c - nvcc compiler not optimizing -


why compiler not doing trivial optimizations can done in kernel? have following code matrix multiplication:

__global__ void matrixmultiply(float * a, float * b, float * c,         int numarows, int numacolumns,         int numbrows, int numbcolumns,         int numcrows, int numccolumns) {      int n=numacolumns;     int row=blockidx.x*blockdim.x+threadidx.x;     int col=blockidx.y*blockdim.y+threadidx.y;     if((row<numcrows) && (col<numccolumns)){         for(int k=0;k<n;++k){                     c[row*numccolumns+col]+=                         a[row*numacolumns+k]*b[k*numbcolumns+col];         }     }    } 

the example go faster if use temporal register cvalue store sum:

__global__ void matrixmultiply(float * a, float * b, float * c,         int numarows, int numacolumns,         int numbrows, int numbcolumns,         int numcrows, int numccolumns) {      int n=numacolumns;     int row=blockidx.x*blockdim.x+threadidx.x;     int col=blockidx.y*blockdim.y+threadidx.y;     if((row<numcrows) && (col<numccolumns)){         float cvalue=0;         for(int k=0;k<n;++k){             cvalue+=a[row*numacolumns+k]*b[k*numbcolumns+col];         }         c[row*numccolumns+col]=cvalue;     }    } 

in last case, global memory c accessed once whereas in first case accessed many times in loop. isn't kind of optimizations done compilers? both codes have difference of 30% in performance in tests, i'm doing nvcc -o3 ...

since c not declared __restrict__ compiler cannot know whether c same matrix a or b, cannot perform optimization mentioned. when instead used float* __restrict__ c, times both became practically same. chris dodd.


Comments

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Python ctypes access violation with const pointer arguments -