![]() |
VOOZH | about |
| Author: | Wojciech Muła |
|---|---|
| Added on: | 2016-01-17 |
| Last update: | 2017-11-29 (spelling) |
Contents
Surprisingly good results of base64 encoding with SIMD instructions forced me to check the opposite algorithm, i.e. the decoding. The decoding is slightly more complicated as it has to check the input's validity.
A decoder must also consider character '=' at the end of input, but since it's done once, I didn't bother with this in a sample code.
2016-12-18 note: in the initial version of this text I wrongly assumed order of input words, Alfred Klomp noted that the standard imposes a specific order. Today's change fixes this error.
In a basic scalar version the minimum input quantity is 4 character, which are transformed into 3 bytes. The scalar version uses a lookup table translating from ASCII code into a 6-bit data. For non-base64 characters the lookup table contains a special marker. After decoding four characters and checking for errors, the 6-bit words are merged to form a 24-bit word.
The inner loop of the algorithm looks like this.
// 1. translate input constuint8_tb0=lookup[input[i+0]];constuint8_tb1=lookup[input[i+1]];constuint8_tb2=lookup[input[i+2]];constuint8_tb3=lookup[input[i+3]];if(b0,b1,b2orb3invalid){reporterror}// 2. save output bytes // input: [00dddddd|00cccccc|00bbbbbb|00aaaaaa] // output: [00000000|ccdddddd|bbbbcccc|aaaaaabb] *out++=(b1>>4)|(b0<<2);*out++=(b2>>2)|(b1<<4);*out++=b3|(b2<<6);
Forming the output is expansive, it requires:
It is possible to get rid off the bit-shift and use four lookup tables for each input byte. Each table contains already-shifted 6-bit data fields, thanks to that only bit-ors are required. It is at cost of bigger memory requirements. The improved version requires 4kB (4*4*256), while the basic one requires only 256 bytes.
Following code shows the idea.
// 1. lookup and build output word at the same time constuint32_tdword=lookup32[0][input[i+0]]|lookup32[1][input[i+1]]|lookup32[2][input[i+2]]|lookup32[3][input[i+3]];if(dwordisinvalid){reporterror}// 2. save 32 bits reinterpret_cast<uint32_t*>(out)=dword;out+=3;
Another negative feature is that the method saves 4 bytes instead of 3.
In an SSE version the lookup table must be encoded as a sequence of instructions. Such program will perform parallel translation of all input bytes.
Following table describes how to map an input character into a 6-bit data ( is the character code), according to the base64 standard.
| range | expression | after constant folding |
|---|---|---|
| A-Z | i - ord('A') | i - 65 |
| a-z | i - ord('a') + 26 | i - 71 |
| 0-9 | i - ord('0') + 52 | i + 4 |
| '+' | i - ord('+') + 62 | i + 19 |
| '/' | i - ord('/') + 63 | i + 16 |
| other | error | |
The basic step of the vector version produces a vector of constants (from the third column) for characters matching given range. After checking all five ranges, the vectors are merged. Since none of constants is zero, then if the merged vector has any zero byte, it means that the input contains an invalid character.
If everything is OK, then the last step is to simple add the input vector and the merged one. The result vector contains 6-bit data in each byte.
Procedure doing the translation.
__m128ilookup_base(const__m128iinput){// shift for range 'A' - 'Z' const__m128ige_A=_mm_cmpgt_epi8(input,packed_byte('A'-1));const__m128ile_Z=_mm_cmplt_epi8(input,packed_byte('Z'+1));const__m128irange_AZ=_mm_and_si128(packed_byte(-65),_mm_and_si128(ge_A,le_Z));// shift for range 'a' - 'z' const__m128ige_a=_mm_cmpgt_epi8(input,packed_byte('a'-1));const__m128ile_z=_mm_cmplt_epi8(input,packed_byte('z'+1));const__m128irange_az=_mm_and_si128(packed_byte(-71),_mm_and_si128(ge_a,le_z));// shift for range '0' - '9' const__m128ige_0=_mm_cmpgt_epi8(input,packed_byte('0'-1));const__m128ile_9=_mm_cmplt_epi8(input,packed_byte('9'+1));const__m128irange_09=_mm_and_si128(packed_byte(4),_mm_and_si128(ge_0,le_9));// shift for character '+' const__m128ieq_plus=_mm_cmpeq_epi8(input,packed_byte('+'));const__m128ichar_plus=_mm_and_si128(packed_byte(19),eq_plus);// shift for character '/' const__m128ieq_slash=_mm_cmpeq_epi8(input,packed_byte('/'));const__m128ichar_slash=_mm_and_si128(packed_byte(16),eq_slash);// merge partial results const__m128ishift=_mm_or_si128(range_AZ,_mm_or_si128(range_az,_mm_or_si128(range_09,_mm_or_si128(char_plus,char_slash))));constautomask=_mm_movemask_epi8(_mm_cmpeq_epi8(shift,packed_byte(0)));if(mask){// report an error }return_mm_add_epi8(input,shift);}
Number of operations:
Number of constants: 9.
Total number of operations is 23. This is 1.4375 instructions per character.
Note: the SSE compare for less or greater works on signed values, in this case it is not a problem.
Instead of creating partial shifts (constants , and so on) and then merging them with a series of bit-ors, the shift vector could be updated using the byte blend instruction ().
Below is an updated procedure.
__m128ilookup_byte_blend(const__m128iinput){__m128ishift;// shift for range 'A' - 'Z' const__m128ige_A=_mm_cmpgt_epi8(input,packed_byte('A'-1));const__m128ile_Z=_mm_cmplt_epi8(input,packed_byte('Z'+1));shift=_mm_and_si128(packed_byte(-65),_mm_and_si128(ge_A,le_Z));// shift for range 'a' - 'z' const__m128ige_a=_mm_cmpgt_epi8(input,packed_byte('a'-1));const__m128ile_z=_mm_cmplt_epi8(input,packed_byte('z'+1));shift=_mm_blendv_epi8(shift,packed_byte(-71),_mm_and_si128(ge_a,le_z));// shift for range '0' - '9' const__m128ige_0=_mm_cmpgt_epi8(input,packed_byte('0'-1));const__m128ile_9=_mm_cmplt_epi8(input,packed_byte('9'+1));shift=_mm_blendv_epi8(shift,packed_byte(4),_mm_and_si128(ge_0,le_9));// shift for character '+' const__m128ieq_plus=_mm_cmpeq_epi8(input,packed_byte('+'));shift=_mm_blendv_epi8(shift,packed_byte(19),eq_plus);// shift for character '/' const__m128ieq_slash=_mm_cmpeq_epi8(input,packed_byte('/'));shift=_mm_blendv_epi8(shift,packed_byte(16),eq_slash);constautomask=_mm_movemask_epi8(_mm_cmpeq_epi8(shift,packed_byte(0)));if(mask){// report an error }return_mm_add_epi8(input,shift);}
Number of operations:
Number of constants: 9.
The total number of operations is 19, three less than the in base version. This is 1.1875 instructions per character. However, the instruction has both latency and throughput equal 2 cycles and it seems to be a problem. As experiments showed, sometimes use of the instruction gives small improvement, but sometimes makes things slower.
This section describes another lookup procedure that employs the technique described in a separate article. The lookup procedure uses instruction. I very like this instruction, but experiments showed that this version is slower. My heart sank.
Instead of checking all possible ranges as in all other methods, this algorithm determines possible range for a byte and the associated shift value. Then validates a byte against that single range and adds the selected shift.
To select both the range and the shift, the higher nibble of byte is used. Lets look at the rewritten lookup table:
| input [hex] | valid input [ASCII] | valid range | shift |
|---|---|---|---|
| 0x | invalid | ||
| 1x | |||
| 2x | '+' | 0x2b .. 0x2b | 0x3e |
| '/' | 0x2f .. 0x2f | 0x3f | |
| 3x | '0' .. '9' | 0x30 .. 0x39 | 0x34 |
| 4x | 'A' .. 'O' | 0x41 .. 0x4f | 0x00 |
| 5x | 'P' .. 'Z' | 0x50 .. 0x5a | 0x0f |
| 6x | 'a' .. 'o' | 0x61 .. 0x6f | 0x1a |
| 7x | 'p' .. 'z' | 0x70 .. 0x7a | 0x29 |
| 8x .. Fx | invalid | ||
The outline of algorithm (for a single byte):
higher_nibble=byte>>4;upper_limit=upper_limit_LUT[higher_nibble];lower_limit=lower_limit_LUT[higher_nibble];if(byte<lower_limit||byte>upper_limit){reporterror}decoded=byte+shift_LUT[higher_nibble];
For nibbles in set {0, 1, 8, 9, a, b, c, d, e, f} the LUTs define invalid ranges, always yielding errors. For nibbles {3, 4, 5, 6, 7} the ranges are well defined.
The only exception occurs for nibble 2, as there are two distinct values associated with these bytes. The special case is solved by choosing the first, one-element range 0x2b and introducing an additional code handling input equal to 0x2f. The extra code doesn't complicate the algorithm too much:
higher_nibble=byte>>4;upper_limit=upper_limit_LUT[higher_nibble];lower_limit=lower_limit_LUT[higher_nibble];if((byte<lower_limit||byte>upper_limit)&&byte!=0x2f){reporterror}decoded=byte+shift_LUT[higher_nibble];if(byte==0x2f){decoded-=3;// (0x2b - 0x2f) + (0x3f - 0x3e) = -4 + 1 = -3 }
The SIMD (SSE) code implementing above algorithm:
__m128ilookup_pshufb(const__m128iinput){const__m128ihigher_nibble=_mm_srli_epi32(input,4)&packed_byte(0x0f);constcharlinv=1;constcharhinv=0;const__m128ilower_bound_LUT=_mm_setr_epi8(/* 0 */linv,/* 1 */linv,/* 2 */0x2b,/* 3 */0x30,/* 4 */0x41,/* 5 */0x50,/* 6 */0x61,/* 7 */0x70,/* 8 */linv,/* 9 */linv,/* a */linv,/* b */linv,/* c */linv,/* d */linv,/* e */linv,/* f */linv);const__m128iupper_bound_LUT=_mm_setr_epi8(/* 0 */hinv,/* 1 */hinv,/* 2 */0x2b,/* 3 */0x39,/* 4 */0x4f,/* 5 */0x5a,/* 6 */0x6f,/* 7 */0x7a,/* 8 */hinv,/* 9 */hinv,/* a */hinv,/* b */hinv,/* c */hinv,/* d */hinv,/* e */hinv,/* f */hinv);// the difference between the shift and lower bound const__m128ishift_LUT=_mm_setr_epi8(/* 0 */0x00,/* 1 */0x00,/* 2 */0x3e-0x2b,/* 3 */0x34-0x30,/* 4 */0x00-0x41,/* 5 */0x0f-0x50,/* 6 */0x1a-0x61,/* 7 */0x29-0x70,/* 8 */0x00,/* 9 */0x00,/* a */0x00,/* b */0x00,/* c */0x00,/* d */0x00,/* e */0x00,/* f */0x00);const__m128iupper_bound=_mm_shuffle_epi8(upper_bound_LUT,higher_nibble);const__m128ilower_bound=_mm_shuffle_epi8(lower_bound_LUT,higher_nibble);const__m128ibelow=_mm_cmplt_epi8(input,lower_bound);const__m128iabove=_mm_cmpgt_epi8(input,upper_bound);const__m128ieq_2f=_mm_cmpeq_epi8(input,packed_byte(0x2f));// in_range = not (below or above) or eq_2f // outside = not in_range = below or above and not eq_2f (from deMorgan law) const__m128ioutside=_mm_andnot_si128(eq_2f,above|below);constautomask=_mm_movemask_epi8(outside);if(mask){reporterror}const__m128ishift=_mm_shuffle_epi8(shift_LUT,higher_nibble);const__m128it0=_mm_add_epi8(input,shift);const__m128iresult=_mm_add_epi8(t0,_mm_and_si128(eq_2f,packed_byte(-3)));returnresult;}
Number of operations:
Number of constants: 6.
Total number of operations is 14. This is 0.875 instructions per character.
This is a slightly improved version of the previous method, where the range checking is done using bit-masks.
First we group all shift values by the lower nibble; zero shift means invalid input. We can note that in each column (except the 3rd) there are two different values: zero or a specific value. The 3rd column needs a special care, but it is not difficult.
| lower nibble | higher nibble ('.' denotes zero) | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f | |
| 0 | . | . | . | 4 | . | -65 | . | -71 | . | . | . | . | . | . | . | . |
| 1 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
| 2 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
| 3 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
| 4 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
| 5 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
| 6 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
| 7 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
| 8 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
| 9 | . | . | . | 4 | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
| a | . | . | . | . | -65 | -65 | -71 | -71 | . | . | . | . | . | . | . | . |
| b | . | . | 19 | . | -65 | . | -71 | . | . | . | . | . | . | . | . | . |
| c | . | . | . | . | -65 | . | -71 | . | . | . | . | . | . | . | . | . |
| d | . | . | . | . | -65 | . | -71 | . | . | . | . | . | . | . | . | . |
| e | . | . | . | . | -65 | . | -71 | . | . | . | . | . | . | . | . | . |
| f | . | . | 16 | . | -65 | . | -71 | . | . | . | . | . | . | . | . | . |
Using the higher nibble we can pick a shift from vector of non-zero values from each column: [0, 0, 19, 4, -65, -65, -71, -71, 0, 0, 0, 0, 0, 0, 0, 0]. The lower nibble selects a bit-mask, representing a set of valid values for higher nibble. A bit-mask is derived directly from rows of the table, each non-zero value sets a bit in the mask. There are six different bit-masks.
We need an extra fix-up code to distinguish between the '+' and '/' character, i.e. choose either 16 or 19 from the 3rd column.
The core of the algorithm goes as follows:
The SIMD (SSE) code implementing above algorithm.
__m128ilookup_pshufb_bitmask(const__m128iinput){const__m128ihigher_nibble=_mm_srli_epi32(input,4)&packed_byte(0x0f);const__m128ilower_nibble=input&packed_byte(0x0f);const__m128ishiftLUT=_mm_setr_epi8(0,0,19,4,-65,-65,-71,-71,0,0,0,0,0,0,0,0);const__m128imaskLUT=_mm_setr_epi8(/* 0 */char(0b10101000),/* 1 .. 9 */char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),/* 10 */char(0b11110000),/* 11 */char(0b01010100),/* 12 .. 14 */char(0b01010000),char(0b01010000),char(0b01010000),/* 15 */char(0b01010100));const__m128ibitposLUT=_mm_setr_epi8(0x01,0x02,0x04,0x08,0x10,0x20,0x40,char(0x80),0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00);// shift + fix-up for '/' and '+' const__m128ish=_mm_shuffle_epi8(shiftLUT,higher_nibble);const__m128ieq_2f=_mm_cmpeq_epi8(input,packed_byte(0x2f));const__m128ishift=_mm_blendv_epi8(sh,packed_byte(16),eq_2f);// load bit-masks const__m128iM=_mm_shuffle_epi8(maskLUT,lower_nibble);// (1 << higher_nibble[i]) & 0xff const__m128ibit=_mm_shuffle_epi8(bitposLUT,higher_nibble);const__m128inon_match=_mm_cmpeq_epi8(_mm_and_si128(M,bit),_mm_setzero_si128());constautomask=_mm_movemask_epi8(non_match);if(mask){// report error }const__m128iresult=_mm_add_epi8(input,shift);returnresult;}
Number of operations:
Number of constants: 6.
The total number of operations is 12. This is 0.750 instructions per character.
| operation | algorithm | ||||
|---|---|---|---|---|---|
| base | byte-blend | incremental | pshufb | pshufb-bitmask | |
| comparison (le/gt/eq) | 9 | 9 | 9 | 3 | 2 |
| bit-and | 8 | 4 | 0 | 2 | 3 |
| bit-andnot | 0 | 0 | 0 | 1 | 0 |
| bit-or | 4 | 0 | 0 | 1 | 0 |
| shift | 0 | 0 | 0 | 1 | 1 |
| add | 1 | 1 | 10 | 2 | 1 |
| sub | 0 | 0 | 0 | 0 | 0 |
| byte-blend | 0 | 3 | 0 | 0 | 1 |
| movemask | 1 | 1 | 1 | 1 | 1 |
| pshufb | 0 | 0 | 1 | 3 | 3 |
| total instructions | 23 | 19 | 21 | 14 | 12 |
| instructions/byte | 1.4375 | 1.1875 | 1.3125 | 0.875 | 0.75 |
| constants count | 9 | 9 | 10 | 6 | 6 |
The result of procedure is a vector fo 32-bit words having following bit-layout:
[00dddddd|00cccccc|00bbbbbb|00aaaaaa]
Bits , , and are data. The expected output is:
[00000000|aaaaaabb|bbbbcccc|ccdddddd]
Once bits are packed into 24-bit words, all bytes are shuffled in order to build a 12-byte result. Please note that also the order of bytes within 24-bit words is changed in this step, i.e. bytes 0 and 2 are swapped:
[ddddddcc|bbbbcccc|aaaaaabb]
Fortunately everything is done by a single invocation.
The naive variant isolates all 6-bit fields, shifts them to required positions and merges into one word.
#define packed_dword(x) _mm_set1_epi32(x) #define masked(x, mask) _mm_and_si128(x, _mm_set1_epi32(mask)) const__m128ica=masked(values,0x003f003f);const__m128idb=masked(values,0x3f003f00);
// t0 = [0000cccc|ccdddddd|0000aaaa|aabbbbbb] const__m128it0=_mm_or_si128(_mm_srli_epi32(db,8),_mm_slli_epi32(ca,6));
// t1 = [dddd0000|aaaaaabb|bbbbcccc|dddddddd] const__m128it1=_mm_or_si128(_mm_srli_epi32(t0,16),_mm_slli_epi32(t0,12));
// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd] result=masked(t1,0x00ffffff);
Number of operations:
The first merge of adjacent bytes is performed by instruction . The instruction vertically multiplies signed bytes yielding 16-bit signed intermediate values and then the values are added. Input values are 6-bit, so obviously are non-negative and the instruction could be safely used.
The instruction is used to perform shift & bit-or in a single step.
Then merging 12-bit fields is done by another multiply-add instruction . The instruction also operates on signed values, but again --- the inputs are never negative.
// input: [00dddddd|00cccccc|00bbbbbb|00aaaaaa] // merge: [0000cccc|ccdddddd|0000aaaa|aabbbbbb] const__m128imerge_ab_and_bc=_mm_maddubs_epi16(values,packed_dword(0x01400140));// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd] return_mm_madd_epi16(merge_ab_and_bc,packed_dword(0x00011000));
Number of operations:
The result is stoed in an XMM register, however only 12 lowest bytes are required. My experiments on Core i5 have showed that saving selected bytes using dedicated instruction (intrinsic ) is slower than ordinary write of 16 bytes which overwrites the previous 4 bytes.
With help of BMI2 instruction making a 24-bit word from 6-bit words is pretty simple. However, bytes 0th and 2nd of each word have to be swapped, that costs extra instructions. Additionally, the instruction works on CPU registers and this require extra instruction to transfer data from the SIMD registers.
Here is a procedure which processes 64-bit word
uint64_tpack_bytes(uint64_tv){constuint64_tp=_pext_u64(v,0x3f3f3f3f3f3f3f3f);constuint64_tb0=p&0x0000ff0000ff;constuint64_tb1=p&0x00ff0000ff00;constuint64_tb2=p&0xff0000ff0000;return(b0<<16)|b1|(b2>>16);}
It costs:
An AVX2 version of procedure is nearly the same as the SSE version. The only minor difference is caused by lack of comparison "less or equal", thus all range checking have to use the "grater" relation.
An output from the AVX2 could be either processed using standard SIMD operations or BMI2 instructions.
The sample code is available at github, there are three programs:
The CPU architecture: Westmere i5 M540 @ 2.53GHz
GCC: 6.2.0 (Debian)
| procedure | lookup | pack | time [s] | speedup | |
|---|---|---|---|---|---|
| improved scalar | N/A | N/A | 0.04722 | 1.00 | ██████████ |
| scalar | N/A | N/A | 0.06875 | 0.69 | ██████▊ |
| SSE | base | naive | 0.02892 | 1.63 | ████████████████▎ |
| SSE | byte blend | naive | 0.02970 | 1.59 | ███████████████▉ |
| SSE | incremental | naive | 0.03249 | 1.45 | ██████████████▌ |
| SSE | pshufb | naive | 0.02753 | 1.72 | █████████████████▏ |
| SSE | base | multiply-add | 0.02554 | 1.85 | ██████████████████▍ |
| SSE | byte blend | multiply-add | 0.02511 | 1.88 | ██████████████████▊ |
| SSE | incremental | multiply-add | 0.02862 | 1.65 | ████████████████▍ |
| SSE | pshufb | multiply-add | 0.02170 | 2.18 | █████████████████████▊ |
| SSE | pshufb bitmask | multiply-add | 0.02085 | 2.26 | ██████████████████████▋ |
Conclusions:
The CPU architecture: Haswell i7-4770 CPU @ 3.40GHz.
GCC: 5.4.1 (Ubuntu)
| procedure | lookup | pack | time [s] | speedup | |
|---|---|---|---|---|---|
| improved scalar | N/A | N/A | 0.02281 | 1.00 | ██████████ |
| scalar | N/A | N/A | 0.04965 | 0.46 | ████▌ |
| scalar & BMI2 | N/A | N/A | 0.04485 | 0.51 | █████ |
| SSE | base | naive | 0.01683 | 1.36 | █████████████▌ |
| SSE | byte blend | naive | 0.01783 | 1.28 | ████████████▊ |
| SSE | incremental | naive | 0.01790 | 1.27 | ████████████▋ |
| SSE | pshufb | naive | 0.01286 | 1.77 | █████████████████▋ |
| SSE | base | multiply-add | 0.01383 | 1.65 | ████████████████▍ |
| SSE | byte blend | multiply-add | 0.01365 | 1.67 | ████████████████▋ |
| SSE | incremental | multiply-add | 0.01500 | 1.52 | ███████████████▏ |
| SSE | pshufb | multiply-add | 0.00967 | 2.36 | ███████████████████████▌ |
| SSE | pshufb bitmask | multiply-add | 0.00927 | 2.46 | ████████████████████████▌ |
| SSE & BMI2 | base | N/A | 0.02184 | 1.04 | ██████████▍ |
| SSE & BMI2 | byte blend | N/A | 0.02288 | 1.00 | █████████▉ |
| SSE & BMI2 | incremental | N/A | 0.02190 | 1.04 | ██████████▍ |
| AVX2 | base | naive | 0.01108 | 2.06 | ████████████████████▌ |
| AVX2 | byte blend | naive | 0.01229 | 1.86 | ██████████████████▌ |
| AVX2 | pshufb | naive | 0.00897 | 2.54 | █████████████████████████▍ |
| AVX2 | base | multiply-add | 0.00973 | 2.34 | ███████████████████████▍ |
| AVX2 | byte blend | multiply-add | 0.00977 | 2.33 | ███████████████████████▎ |
| AVX2 | pshufb | multiply-add | 0.00849 | 2.69 | ██████████████████████████▊ |
| AVX2 | pshufb bitmask | multiply-add | 0.00848 | 2.69 | ██████████████████████████▉ |
| AVX2 & BMI2 | base | N/A | 0.02236 | 1.02 | ██████████▏ |
| AVX2 & BMI2 | byte blend | N/A | 0.02371 | 0.96 | █████████▌ |
Conclusions:
The CPU architecture: Skylake i7-6700 CPU @ 3.40GHz
GCC: 5.4.1 (Ubuntu)
| procedure | lookup | pack | time [s] | speedup | |
|---|---|---|---|---|---|
| improved scalar | N/A | N/A | 0.02084 | 1.00 | ██████████ |
| scalar | N/A | N/A | 0.04980 | 0.42 | ████▏ |
| scalar & BMI2 | N/A | N/A | 0.04494 | 0.46 | ████▋ |
| SSE | base | naive | 0.01502 | 1.39 | █████████████▊ |
| SSE | byte blend | naive | 0.01645 | 1.27 | ████████████▋ |
| SSE | incremental | naive | 0.01684 | 1.24 | ████████████▍ |
| SSE | pshufb | naive | 0.01229 | 1.70 | ████████████████▉ |
| SSE | base | multiply-add | 0.01216 | 1.71 | █████████████████▏ |
| SSE | byte blend | multiply-add | 0.01405 | 1.48 | ██████████████▊ |
| SSE | incremental | multiply-add | 0.01198 | 1.74 | █████████████████▍ |
| SSE | pshufb | multiply-add | 0.00888 | 2.35 | ███████████████████████▍ |
| SSE | pshufb bitmask | multiply-add | 0.00847 | 2.46 | ████████████████████████▌ |
| SSE & BMI2 | base | N/A | 0.02090 | 1.00 | █████████▉ |
| SSE & BMI2 | byte blend | N/A | 0.02097 | 0.99 | █████████▉ |
| SSE & BMI2 | incremental | N/A | 0.02098 | 0.99 | █████████▉ |
| AVX2 | base | naive | 0.01006 | 2.07 | ████████████████████▋ |
| AVX2 | byte blend | naive | 0.00998 | 2.09 | ████████████████████▉ |
| AVX2 | pshufb | naive | 0.00816 | 2.55 | █████████████████████████▌ |
| AVX2 | base | multiply-add | 0.00852 | 2.45 | ████████████████████████▍ |
| AVX2 | byte blend | multiply-add | 0.00842 | 2.48 | ████████████████████████▊ |
| AVX2 | pshufb | multiply-add | 0.00720 | 2.89 | ████████████████████████████▉ |
| AVX2 | pshufb bitmask | multiply-add | 0.00705 | 2.96 | █████████████████████████████▌ |
| AVX2 & BMI2 | base | N/A | 0.01994 | 1.05 | ██████████▍ |
| AVX2 & BMI2 | byte blend | N/A | 0.01982 | 1.05 | ██████████▌ |
The CPU architecture: Bulldozer FX-8150 CPU
GCC: 4.8.4 (Ubuntu)
| procedure | lookup | pack | time [s] | speedup | |
|---|---|---|---|---|---|
| improved scalar | N/A | N/A | 0.03223 | 1.00 | ██████████ |
| scalar | N/A | N/A | 0.06352 | 0.51 | █████ |
| SSE | base | naive | 0.02387 | 1.35 | █████████████▌ |
| SSE | byte blend | naive | 0.02316 | 1.39 | █████████████▉ |
| SSE | incremental | naive | 0.02626 | 1.23 | ████████████▎ |
| SSE | pshufb | naive | 0.02015 | 1.60 | ███████████████▉ |
| SSE | base | multiply-add | 0.02107 | 1.53 | ███████████████▎ |
| SSE | byte blend | multiply-add | 0.01971 | 1.64 | ████████████████▎ |
| SSE | incremental | multiply-add | 0.02215 | 1.46 | ██████████████▌ |
| SSE | pshufb | multiply-add | 0.01660 | 1.94 | ███████████████████▍ |
| SSE | pshufb bitmask | multiply-add | 0.01656 | 1.95 | ███████████████████▍ |
The AVX2 version wouldn't be possible without Nathan Kurz's help and Daniel Lemire who kindly gave me access to test machines having brand new CPUs (Haswell and Skylake, and also AMD Bulldozer).