c - Performance worsens when using SSE (Simple addition of integer arrays) -
i'm trying use sse intrinsics add 2 32-bit signed int arrays. i'm getting poor performance compared linear addition.
platform - intel core i3 550, gcc 4.4.3, ubuntu 10.04 (bit old, yeah)
#define iter 1000 typedef union sint4_u { __m128i v; sint32_t x[4]; } sint4;
the functions:
void compute(sint32_t *a, sint32_t *b, sint32_t *c) { sint32_t len = 96000; sint32_t i, j; __m128i x __attribute__ ((aligned(16))); __m128i y __attribute__ ((aligned(16))); sint4 z; for(j = 0; j < iter; j++) { for(i = 0; < len; += 4) { x = _mm_set_epi32(a[i + 0], a[i + 1], a[i + 2], a[i + 3]); y = _mm_set_epi32(b[i + 0], b[i + 1], b[i + 2], b[i + 3]); z.v = _mm_add_epi32(x, y); c[i + 0] = z.x[3]; c[i + 1] = z.x[2]; c[i + 2] = z.x[1]; c[i + 3] = z.x[0]; } } return; } void compute_s(sint32_t *a, sint32_t *b, sint32_t *c) { sint32_t len = 96000; sint32_t i, j; for(j = 0; j < iter; j++) { for(i = 0; < len; i++) { c[i] = a[i] + b[i]; } } return; }
the results:
➜ c gcc -msse4.2 simd.c ➜ c ./a.out time elapsed (sse): 612.520000 ms time elapsed (scalar): 401.713000 ms ➜ c gcc -o3 -msse4.2 simd.c ➜ c ./a.out time elapsed (sse): 135.124000 ms time elapsed (scalar): 46.438000 ms
on using -o3
, sse version becomes 3 times slower (!!). doing wrong? if skip loading c
in compute
, still takes 100 ms without optimizations.
edit - suggested in comments, replaced _mm_set _mm_load, here updated times -
➜ c gcc audproc.c -msse4 ➜ c ./a.out time elapsed (sse): 303.931000 ms time elapsed (scalar): 413.701000 ms ➜ c gcc -o3 audproc.c -msse4 ➜ c ./a.out time elapsed (sse): 82.532000 ms time elapsed (scalar): 48.104000 ms
much better, still close theoretical gain of 4x. also, why vectorization slower @ o3
? also, how rid of warning? (i tried adding __vector__
declaration got more warnings instead. :( )
audproc.c: in function ‘compute’: audproc.c:54: warning: passing argument 1 of ‘_mm_load_si128’ incompatible pointer type /usr/lib/gcc/i486-linux-gnu/4.4.3/include/emmintrin.h:677: note: expected ‘const long long int __vector__ *’ argument of type ‘const sint32_t *’
as mentioned in comments, in order performance benefits of simd should avoid scalar operations in loop, i.e. rid of _mm_set_epi32
pseudo-intrinsics , union storing simd results. here fixed version of function:
void compute(const sint32_t *a, const sint32_t *b, sint32_t *c) { sint32_t len = 96000; sint32_t i, j; for(j = 0; j < iter; j++) { for(i = 0; < len; += 4) { __m128i x = _mm_loadu_si128((__m128i *)&a[i]); __m128i y = _mm_loadu_si128((__m128i *)&b[i]); __m128i z = _mm_add_epi32(x, y); _mm_storeu_si128((__m128i *)&c[i], z); } } }
Comments
Post a Comment