c++ - How can I shuffle bits efficiently? -


i need shuffle 16 bit unsigned integer in way indexes land in lower byte, , odd indexes land in upper byte.

input: fedcba98 76543210 (contiguously numbered)  output: fdb97531 eca86420 (even , odd separated) 

my code looks @ moment:

typedef unsigned short u16;  u16 segregate(u16 x) {     u16 g = (x & 0x0001);     u16 h = (x & 0x0004) >> 1;     u16 = (x & 0x0010) >> 2;     u16 j = (x & 0x0040) >> 3;     u16 k = (x & 0x0100) >> 4;     u16 l = (x & 0x0400) >> 5;     u16 m = (x & 0x1000) >> 6;     u16 n = (x & 0x4000) >> 7;      u16 o = (x & 0x0002) << 7;     u16 p = (x & 0x0008) << 6;     u16 q = (x & 0x0020) << 5;     u16 r = (x & 0x0080) << 4;     u16 s = (x & 0x0200) << 3;     u16 t = (x & 0x0800) << 2;     u16 u = (x & 0x2000) << 1;     u16 v = (x & 0x8000);      return g | h | | j | k | l | m | n | o | p | q | r | s | t | u | v; } 

i wonder if there more elegant solution extracting , shifting each individual bit?

there convenient web resource helps solving many bit permutation problems: code generator bit permutations. in particular case feeding "0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15" page produces pretty fast code.

unfortunately code generator cannot produce 64-bit code (though download sources , add option). if need perform 4 permutations in parallel using 64-bit instructions, have extend involved bitmasks 64 bits manually:

uint64_t bit_permute_step(uint64_t x, uint64_t m, unsigned shift) {   uint64_t t;   t = ((x >> shift) ^ x) & m;   x = (x ^ t) ^ (t << shift);   return x; }  uint64_t segregate4(uint64_t x) { // generated http://programming.sirrida.de/calcperm.php, extended 64-bit   x = bit_permute_step(x, 0x2222222222222222ull, 1);   x = bit_permute_step(x, 0x0c0c0c0c0c0c0c0cull, 2);   x = bit_permute_step(x, 0x00f000f000f000f0ull, 4);   return x; } 

level of parallelism increased more (8 or 16 permutations @ once) sse instructions. (and recent versions of gcc can vectorize code automatically).

if parallelism not required , data cache not extensively used other parts of program, better alternative use lookup table. various lut approacehes discussed in other answers, still more said here:

  1. the first , last bits of 16-bit word never permuted, need shuffle bits 1..14. (if want perform task single lut access) enough have lut 16k entries means 32k of memory.
  2. we combine table lookup , computation approaches. 2 lookups in single 256-byte table shuffle each source byte separately. after need exchange 2 middle 4-bit nibbles. allows keep lookup table small, uses 2 memory accesses, , needs not calculations (i.e. balances calculations , memory accesses).

here implementation of second approach:

#define b10(x)          x+0x00,      x+0x10,      x+0x01,      x+0x11 #define b32(x)      b10(x+0x00), b10(x+0x20), b10(x+0x02), b10(x+0x22) #define b54(x)      b32(x+0x00), b32(x+0x40), b32(x+0x04), b32(x+0x44) uint8_t lut[256] = {b54(  0x00), b54(  0x80), b54(  0x08), b54(  0x88)}; #undef b54 #undef b32 #undef b10  uint_fast16_t segregatelut(uint_fast16_t x) {   uint_fast16_t low = lut[x & 0x00ff];   low |= low << 4;   uint_fast16_t high = lut[x >> 8] << 4;   high |= high << 4;   return (low & 0x0f0f) | (high & 0xf0f0); } 

but fastest approach (if portability not issue) using pext instruction bmi2 instruction set as noted nils pipenbrinck. pair of 64-bit pext perform 4 16-bit shuffles in parallel. since pext instruction intended kind of bit permutations, approach outperforms others.


Comments

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

jsf - How to ajax update an item in the footer of a PrimeFaces dataTable? -

django - CSRF verification failed. Request aborted. CSRF cookie not set -