c - nvcc compiler not optimizing -
why compiler not doing trivial optimizations can done in kernel? have following code matrix multiplication:
__global__ void matrixmultiply(float * a, float * b, float * c, int numarows, int numacolumns, int numbrows, int numbcolumns, int numcrows, int numccolumns) { int n=numacolumns; int row=blockidx.x*blockdim.x+threadidx.x; int col=blockidx.y*blockdim.y+threadidx.y; if((row<numcrows) && (col<numccolumns)){ for(int k=0;k<n;++k){ c[row*numccolumns+col]+= a[row*numacolumns+k]*b[k*numbcolumns+col]; } } } the example go faster if use temporal register cvalue store sum:
__global__ void matrixmultiply(float * a, float * b, float * c, int numarows, int numacolumns, int numbrows, int numbcolumns, int numcrows, int numccolumns) { int n=numacolumns; int row=blockidx.x*blockdim.x+threadidx.x; int col=blockidx.y*blockdim.y+threadidx.y; if((row<numcrows) && (col<numccolumns)){ float cvalue=0; for(int k=0;k<n;++k){ cvalue+=a[row*numacolumns+k]*b[k*numbcolumns+col]; } c[row*numccolumns+col]=cvalue; } } in last case, global memory c accessed once whereas in first case accessed many times in loop. isn't kind of optimizations done compilers? both codes have difference of 30% in performance in tests, i'm doing nvcc -o3 ...
since c not declared __restrict__ compiler cannot know whether c same matrix a or b, cannot perform optimization mentioned. when instead used float* __restrict__ c, times both became practically same. chris dodd.
Comments
Post a Comment