c - nvcc compiler not optimizing -
why compiler not doing trivial optimizations can done in kernel? have following code matrix multiplication:
__global__ void matrixmultiply(float * a, float * b, float * c, int numarows, int numacolumns, int numbrows, int numbcolumns, int numcrows, int numccolumns) { int n=numacolumns; int row=blockidx.x*blockdim.x+threadidx.x; int col=blockidx.y*blockdim.y+threadidx.y; if((row<numcrows) && (col<numccolumns)){ for(int k=0;k<n;++k){ c[row*numccolumns+col]+= a[row*numacolumns+k]*b[k*numbcolumns+col]; } } }
the example go faster if use temporal register cvalue
store sum:
__global__ void matrixmultiply(float * a, float * b, float * c, int numarows, int numacolumns, int numbrows, int numbcolumns, int numcrows, int numccolumns) { int n=numacolumns; int row=blockidx.x*blockdim.x+threadidx.x; int col=blockidx.y*blockdim.y+threadidx.y; if((row<numcrows) && (col<numccolumns)){ float cvalue=0; for(int k=0;k<n;++k){ cvalue+=a[row*numacolumns+k]*b[k*numbcolumns+col]; } c[row*numccolumns+col]=cvalue; } }
in last case, global memory c
accessed once whereas in first case accessed many times in loop. isn't kind of optimizations done compilers? both codes have difference of 30% in performance in tests, i'm doing nvcc -o3 ...
since c
not declared __restrict__
compiler cannot know whether c
same matrix a
or b
, cannot perform optimization mentioned. when instead used float* __restrict__ c
, times both became practically same. chris dodd.
Comments
Post a Comment