c++ - How can I shuffle bits efficiently? -
i need shuffle 16 bit unsigned integer in way indexes land in lower byte, , odd indexes land in upper byte.
input: fedcba98 76543210 (contiguously numbered) output: fdb97531 eca86420 (even , odd separated) my code looks @ moment:
typedef unsigned short u16; u16 segregate(u16 x) { u16 g = (x & 0x0001); u16 h = (x & 0x0004) >> 1; u16 = (x & 0x0010) >> 2; u16 j = (x & 0x0040) >> 3; u16 k = (x & 0x0100) >> 4; u16 l = (x & 0x0400) >> 5; u16 m = (x & 0x1000) >> 6; u16 n = (x & 0x4000) >> 7; u16 o = (x & 0x0002) << 7; u16 p = (x & 0x0008) << 6; u16 q = (x & 0x0020) << 5; u16 r = (x & 0x0080) << 4; u16 s = (x & 0x0200) << 3; u16 t = (x & 0x0800) << 2; u16 u = (x & 0x2000) << 1; u16 v = (x & 0x8000); return g | h | | j | k | l | m | n | o | p | q | r | s | t | u | v; } i wonder if there more elegant solution extracting , shifting each individual bit?
there convenient web resource helps solving many bit permutation problems: code generator bit permutations. in particular case feeding "0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15" page produces pretty fast code.
unfortunately code generator cannot produce 64-bit code (though download sources , add option). if need perform 4 permutations in parallel using 64-bit instructions, have extend involved bitmasks 64 bits manually:
uint64_t bit_permute_step(uint64_t x, uint64_t m, unsigned shift) { uint64_t t; t = ((x >> shift) ^ x) & m; x = (x ^ t) ^ (t << shift); return x; } uint64_t segregate4(uint64_t x) { // generated http://programming.sirrida.de/calcperm.php, extended 64-bit x = bit_permute_step(x, 0x2222222222222222ull, 1); x = bit_permute_step(x, 0x0c0c0c0c0c0c0c0cull, 2); x = bit_permute_step(x, 0x00f000f000f000f0ull, 4); return x; } level of parallelism increased more (8 or 16 permutations @ once) sse instructions. (and recent versions of gcc can vectorize code automatically).
if parallelism not required , data cache not extensively used other parts of program, better alternative use lookup table. various lut approacehes discussed in other answers, still more said here:
- the first , last bits of 16-bit word never permuted, need shuffle bits 1..14. (if want perform task single lut access) enough have lut 16k entries means 32k of memory.
- we combine table lookup , computation approaches. 2 lookups in single 256-byte table shuffle each source byte separately. after need exchange 2 middle 4-bit nibbles. allows keep lookup table small, uses 2 memory accesses, , needs not calculations (i.e. balances calculations , memory accesses).
here implementation of second approach:
#define b10(x) x+0x00, x+0x10, x+0x01, x+0x11 #define b32(x) b10(x+0x00), b10(x+0x20), b10(x+0x02), b10(x+0x22) #define b54(x) b32(x+0x00), b32(x+0x40), b32(x+0x04), b32(x+0x44) uint8_t lut[256] = {b54( 0x00), b54( 0x80), b54( 0x08), b54( 0x88)}; #undef b54 #undef b32 #undef b10 uint_fast16_t segregatelut(uint_fast16_t x) { uint_fast16_t low = lut[x & 0x00ff]; low |= low << 4; uint_fast16_t high = lut[x >> 8] << 4; high |= high << 4; return (low & 0x0f0f) | (high & 0xf0f0); } but fastest approach (if portability not issue) using pext instruction bmi2 instruction set as noted nils pipenbrinck. pair of 64-bit pext perform 4 16-bit shuffles in parallel. since pext instruction intended kind of bit permutations, approach outperforms others.
Comments
Post a Comment