c - Performance worsens when using SSE (Simple addition of integer arrays) -


i'm trying use sse intrinsics add 2 32-bit signed int arrays. i'm getting poor performance compared linear addition.

platform - intel core i3 550, gcc 4.4.3, ubuntu 10.04 (bit old, yeah)

#define iter 1000 typedef union sint4_u {         __m128i v;         sint32_t x[4]; } sint4; 

the functions:

void compute(sint32_t *a, sint32_t *b, sint32_t *c) {         sint32_t len = 96000;         sint32_t i, j;          __m128i x __attribute__ ((aligned(16)));         __m128i y __attribute__ ((aligned(16)));         sint4 z;          for(j = 0; j < iter; j++) {                 for(i = 0; < len; += 4) {                         x = _mm_set_epi32(a[i + 0], a[i + 1], a[i + 2], a[i + 3]);                         y = _mm_set_epi32(b[i + 0], b[i + 1], b[i + 2], b[i + 3]);                         z.v = _mm_add_epi32(x, y);                          c[i + 0] = z.x[3];                         c[i + 1] = z.x[2];                         c[i + 2] = z.x[1];                         c[i + 3] = z.x[0];                 }            }             return; }  void compute_s(sint32_t *a, sint32_t *b, sint32_t *c) {         sint32_t len = 96000;         sint32_t i, j;         for(j = 0; j < iter; j++) {                 for(i = 0; < len; i++) {                         c[i] = a[i] + b[i];                 }            }            return; } 

the results:

➜  c  gcc -msse4.2 simd.c ➜  c  ./a.out             time elapsed (sse): 612.520000 ms time elapsed (scalar): 401.713000 ms ➜  c  gcc -o3 -msse4.2 simd.c ➜  c  ./a.out                 time elapsed (sse): 135.124000 ms time elapsed (scalar): 46.438000 ms 

on using -o3, sse version becomes 3 times slower (!!). doing wrong? if skip loading c in compute, still takes 100 ms without optimizations.

edit - suggested in comments, replaced _mm_set _mm_load, here updated times -

➜  c    gcc audproc.c -msse4     ➜  c    ./a.out              time elapsed (sse): 303.931000 ms time elapsed (scalar): 413.701000 ms ➜  c    gcc -o3 audproc.c -msse4 ➜  c    ./a.out                  time elapsed (sse): 82.532000 ms time elapsed (scalar): 48.104000 ms 

much better, still close theoretical gain of 4x. also, why vectorization slower @ o3? also, how rid of warning? (i tried adding __vector__ declaration got more warnings instead. :( )

audproc.c: in function ‘compute’: audproc.c:54: warning: passing argument 1 of ‘_mm_load_si128’ incompatible pointer type /usr/lib/gcc/i486-linux-gnu/4.4.3/include/emmintrin.h:677: note: expected ‘const long long int __vector__ *’ argument of type ‘const sint32_t *’ 

as mentioned in comments, in order performance benefits of simd should avoid scalar operations in loop, i.e. rid of _mm_set_epi32 pseudo-intrinsics , union storing simd results. here fixed version of function:

void compute(const sint32_t *a, const sint32_t *b, sint32_t *c) {     sint32_t len = 96000;     sint32_t i, j;      for(j = 0; j < iter; j++)     {         for(i = 0; < len; += 4)         {             __m128i x = _mm_loadu_si128((__m128i *)&a[i]);             __m128i y = _mm_loadu_si128((__m128i *)&b[i]);             __m128i z = _mm_add_epi32(x, y);              _mm_storeu_si128((__m128i *)&c[i], z);         }        }    } 

Comments

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Python ctypes access violation with const pointer arguments -