VOOZH about

URL: http://0x80.pl/notesen/2016-01-17-sse-base64-decoding.html

⇱ Base64 decoding with SIMD instructions


Base64 decoding with SIMD instructions

Author: Wojciech Muła
Added on:2016-01-17
Last update:2017-11-29 (spelling)

Introduction

Surprisingly good results of base64 encoding with SIMD instructions forced me to check the opposite algorithm, i.e. the decoding. The decoding is slightly more complicated as it has to check the input's validity.

A decoder must also consider character '=' at the end of input, but since it's done once, I didn't bother with this in a sample code.

2016-12-18 note: in the initial version of this text I wrongly assumed order of input words, Alfred Klomp noted that the standard imposes a specific order. Today's change fixes this error.

Base algorithm

In a basic scalar version the minimum input quantity is 4 character, which are transformed into 3 bytes. The scalar version uses a lookup table translating from ASCII code into a 6-bit data. For non-base64 characters the lookup table contains a special marker. After decoding four characters and checking for errors, the 6-bit words are merged to form a 24-bit word.

The inner loop of the algorithm looks like this.

// 1. translate input
constuint8_tb0=lookup[input[i+0]];constuint8_tb1=lookup[input[i+1]];constuint8_tb2=lookup[input[i+2]];constuint8_tb3=lookup[input[i+3]];if(b0,b1,b2orb3invalid){reporterror}// 2. save output bytes
// input: [00dddddd|00cccccc|00bbbbbb|00aaaaaa]
// output: [00000000|ccdddddd|bbbbcccc|aaaaaabb]
*out++=(b1>>4)|(b0<<2);*out++=(b2>>2)|(b1<<4);*out++=b3|(b2<<6);

Forming the output is expansive, it requires:

  • 5 bit-shift,
  • 3 bit-ors.

Improved scalar version

It is possible to get rid off the bit-shift and use four lookup tables for each input byte. Each table contains already-shifted 6-bit data fields, thanks to that only bit-ors are required. It is at cost of bigger memory requirements. The improved version requires 4kB (4*4*256), while the basic one requires only 256 bytes.

Following code shows the idea.

// 1. lookup and build output word at the same time
constuint32_tdword=lookup32[0][input[i+0]]|lookup32[1][input[i+1]]|lookup32[2][input[i+2]]|lookup32[3][input[i+3]];if(dwordisinvalid){reporterror}// 2. save 32 bits
reinterpret_cast<uint32_t*>(out)=dword;out+=3;

Another negative feature is that the method saves 4 bytes instead of 3.

SSE version

Vector lookup (base)

In an SSE version the lookup table must be encoded as a sequence of instructions. Such program will perform parallel translation of all input bytes.

Following table describes how to map an input character into a 6-bit data ( is the character code), according to the base64 standard.

range expression after constant folding
A-Z i - ord('A') i - 65
a-z i - ord('a') + 26 i - 71
0-9 i - ord('0') + 52 i + 4
'+' i - ord('+') + 62 i + 19
'/' i - ord('/') + 63 i + 16
other error

The basic step of the vector version produces a vector of constants (from the third column) for characters matching given range. After checking all five ranges, the vectors are merged. Since none of constants is zero, then if the merged vector has any zero byte, it means that the input contains an invalid character.

If everything is OK, then the last step is to simple add the input vector and the merged one. The result vector contains 6-bit data in each byte.

Procedure doing the translation.

__m128ilookup_base(const__m128iinput){// shift for range 'A' - 'Z'
const__m128ige_A=_mm_cmpgt_epi8(input,packed_byte('A'-1));const__m128ile_Z=_mm_cmplt_epi8(input,packed_byte('Z'+1));const__m128irange_AZ=_mm_and_si128(packed_byte(-65),_mm_and_si128(ge_A,le_Z));// shift for range 'a' - 'z'
const__m128ige_a=_mm_cmpgt_epi8(input,packed_byte('a'-1));const__m128ile_z=_mm_cmplt_epi8(input,packed_byte('z'+1));const__m128irange_az=_mm_and_si128(packed_byte(-71),_mm_and_si128(ge_a,le_z));// shift for range '0' - '9'
const__m128ige_0=_mm_cmpgt_epi8(input,packed_byte('0'-1));const__m128ile_9=_mm_cmplt_epi8(input,packed_byte('9'+1));const__m128irange_09=_mm_and_si128(packed_byte(4),_mm_and_si128(ge_0,le_9));// shift for character '+'
const__m128ieq_plus=_mm_cmpeq_epi8(input,packed_byte('+'));const__m128ichar_plus=_mm_and_si128(packed_byte(19),eq_plus);// shift for character '/'
const__m128ieq_slash=_mm_cmpeq_epi8(input,packed_byte('/'));const__m128ichar_slash=_mm_and_si128(packed_byte(16),eq_slash);// merge partial results
const__m128ishift=_mm_or_si128(range_AZ,_mm_or_si128(range_az,_mm_or_si128(range_09,_mm_or_si128(char_plus,char_slash))));constautomask=_mm_movemask_epi8(_mm_cmpeq_epi8(shift,packed_byte(0)));if(mask){// report an error
}return_mm_add_epi8(input,shift);}

Number of operations:

  • comparison (le/gt/eq): 9,
  • bit-and: 8,
  • bit-or: 4,
  • add: 1,
  • movemask: 1.

Number of constants: 9.

Total number of operations is 23. This is 1.4375 instructions per character.

Note: the SSE compare for less or greater works on signed values, in this case it is not a problem.

Vector lookup (byte blend)

Instead of creating partial shifts (constants , and so on) and then merging them with a series of bit-ors, the shift vector could be updated using the byte blend instruction ().

Below is an updated procedure.

__m128ilookup_byte_blend(const__m128iinput){__m128ishift;// shift for range 'A' - 'Z'
const__m128ige_A=_mm_cmpgt_epi8(input,packed_byte('A'-1));const__m128ile_Z=_mm_cmplt_epi8(input,packed_byte('Z'+1));shift=_mm_and_si128(packed_byte(-65),_mm_and_si128(ge_A,le_Z));// shift for range 'a' - 'z'
const__m128ige_a=_mm_cmpgt_epi8(input,packed_byte('a'-1));const__m128ile_z=_mm_cmplt_epi8(input,packed_byte('z'+1));shift=_mm_blendv_epi8(shift,packed_byte(-71),_mm_and_si128(ge_a,le_z));// shift for range '0' - '9'
const__m128ige_0=_mm_cmpgt_epi8(input,packed_byte('0'-1));const__m128ile_9=_mm_cmplt_epi8(input,packed_byte('9'+1));shift=_mm_blendv_epi8(shift,packed_byte(4),_mm_and_si128(ge_0,le_9));// shift for character '+'
const__m128ieq_plus=_mm_cmpeq_epi8(input,packed_byte('+'));shift=_mm_blendv_epi8(shift,packed_byte(19),eq_plus);// shift for character '/'
const__m128ieq_slash=_mm_cmpeq_epi8(input,packed_byte('/'));shift=_mm_blendv_epi8(shift,packed_byte(16),eq_slash);constautomask=_mm_movemask_epi8(_mm_cmpeq_epi8(shift,packed_byte(0)));if(mask){// report an error
}return_mm_add_epi8(input,shift);}

Number of operations:

  • comparison (le/gt/eq): 9,
  • bit-and: 4,
  • bit-or: 0,
  • byte-blend: 3,
  • add: 1,
  • movemask: 1.

Number of constants: 9.

The total number of operations is 19, three less than the in base version. This is 1.1875 instructions per character. However, the instruction has both latency and throughput equal 2 cycles and it seems to be a problem. As experiments showed, sometimes use of the instruction gives small improvement, but sometimes makes things slower.

Vector lookup (incremental)

This section describes another lookup procedure that employs the technique described in a separate article. The lookup procedure uses instruction. I very like this instruction, but experiments showed that this version is slower. My heart sank.

Vector lookup (pshufb)

Instead of checking all possible ranges as in all other methods, this algorithm determines possible range for a byte and the associated shift value. Then validates a byte against that single range and adds the selected shift.

To select both the range and the shift, the higher nibble of byte is used. Lets look at the rewritten lookup table:

input [hex] valid input [ASCII] valid range shift
0x invalid
1x
2x '+' 0x2b .. 0x2b 0x3e
'/' 0x2f .. 0x2f 0x3f
3x '0' .. '9' 0x30 .. 0x39 0x34
4x 'A' .. 'O' 0x41 .. 0x4f 0x00
5x 'P' .. 'Z' 0x50 .. 0x5a 0x0f
6x 'a' .. 'o' 0x61 .. 0x6f 0x1a
7x 'p' .. 'z' 0x70 .. 0x7a 0x29
8x .. Fx invalid

The outline of algorithm (for a single byte):

higher_nibble=byte>>4;upper_limit=upper_limit_LUT[higher_nibble];lower_limit=lower_limit_LUT[higher_nibble];if(byte<lower_limit||byte>upper_limit){reporterror}decoded=byte+shift_LUT[higher_nibble];

For nibbles in set {0, 1, 8, 9, a, b, c, d, e, f} the LUTs define invalid ranges, always yielding errors. For nibbles {3, 4, 5, 6, 7} the ranges are well defined.

The only exception occurs for nibble 2, as there are two distinct values associated with these bytes. The special case is solved by choosing the first, one-element range 0x2b and introducing an additional code handling input equal to 0x2f. The extra code doesn't complicate the algorithm too much:

higher_nibble=byte>>4;upper_limit=upper_limit_LUT[higher_nibble];lower_limit=lower_limit_LUT[higher_nibble];if((byte<lower_limit||byte>upper_limit)&&byte!=0x2f){reporterror}decoded=byte+shift_LUT[higher_nibble];if(byte==0x2f){decoded-=3;// (0x2b - 0x2f) + (0x3f - 0x3e) = -4 + 1 = -3
}

The SIMD (SSE) code implementing above algorithm:

__m128ilookup_pshufb(const__m128iinput){const__m128ihigher_nibble=_mm_srli_epi32(input,4)&packed_byte(0x0f);constcharlinv=1;constcharhinv=0;const__m128ilower_bound_LUT=_mm_setr_epi8(/* 0 */linv,/* 1 */linv,/* 2 */0x2b,/* 3 */0x30,/* 4 */0x41,/* 5 */0x50,/* 6 */0x61,/* 7 */0x70,/* 8 */linv,/* 9 */linv,/* a */linv,/* b */linv,/* c */linv,/* d */linv,/* e */linv,/* f */linv);const__m128iupper_bound_LUT=_mm_setr_epi8(/* 0 */hinv,/* 1 */hinv,/* 2 */0x2b,/* 3 */0x39,/* 4 */0x4f,/* 5 */0x5a,/* 6 */0x6f,/* 7 */0x7a,/* 8 */hinv,/* 9 */hinv,/* a */hinv,/* b */hinv,/* c */hinv,/* d */hinv,/* e */hinv,/* f */hinv);// the difference between the shift and lower bound
const__m128ishift_LUT=_mm_setr_epi8(/* 0 */0x00,/* 1 */0x00,/* 2 */0x3e-0x2b,/* 3 */0x34-0x30,/* 4 */0x00-0x41,/* 5 */0x0f-0x50,/* 6 */0x1a-0x61,/* 7 */0x29-0x70,/* 8 */0x00,/* 9 */0x00,/* a */0x00,/* b */0x00,/* c */0x00,/* d */0x00,/* e */0x00,/* f */0x00);const__m128iupper_bound=_mm_shuffle_epi8(upper_bound_LUT,higher_nibble);const__m128ilower_bound=_mm_shuffle_epi8(lower_bound_LUT,higher_nibble);const__m128ibelow=_mm_cmplt_epi8(input,lower_bound);const__m128iabove=_mm_cmpgt_epi8(input,upper_bound);const__m128ieq_2f=_mm_cmpeq_epi8(input,packed_byte(0x2f));// in_range = not (below or above) or eq_2f
// outside = not in_range = below or above and not eq_2f (from deMorgan law)
const__m128ioutside=_mm_andnot_si128(eq_2f,above|below);constautomask=_mm_movemask_epi8(outside);if(mask){reporterror}const__m128ishift=_mm_shuffle_epi8(shift_LUT,higher_nibble);const__m128it0=_mm_add_epi8(input,shift);const__m128iresult=_mm_add_epi8(t0,_mm_and_si128(eq_2f,packed_byte(-3)));returnresult;}

Number of operations:

  • comparison (le/gt/eq): 3,
  • shift: 1,
  • bit-and: 2,
  • bit-andnot: 1,
  • bit-or: 1,
  • add: 2,
  • movemask: 1
  • pshufb: 3.

Number of constants: 6.

Total number of operations is 14. This is 0.875 instructions per character.

Vector lookup (pshufb with bitmask)

This is a slightly improved version of the previous method, where the range checking is done using bit-masks.

First we group all shift values by the lower nibble; zero shift means invalid input. We can note that in each column (except the 3rd) there are two different values: zero or a specific value. The 3rd column needs a special care, but it is not difficult.

lower nibble higher nibble ('.' denotes zero)
0 1 2 3 4 5 6 7 8 9 a b c d e f
0 . . . 4 . -65 . -71 . . . . . . . .
1 . . . 4 -65 -65 -71 -71 . . . . . . . .
2 . . . 4 -65 -65 -71 -71 . . . . . . . .
3 . . . 4 -65 -65 -71 -71 . . . . . . . .
4 . . . 4 -65 -65 -71 -71 . . . . . . . .
5 . . . 4 -65 -65 -71 -71 . . . . . . . .
6 . . . 4 -65 -65 -71 -71 . . . . . . . .
7 . . . 4 -65 -65 -71 -71 . . . . . . . .
8 . . . 4 -65 -65 -71 -71 . . . . . . . .
9 . . . 4 -65 -65 -71 -71 . . . . . . . .
a . . . . -65 -65 -71 -71 . . . . . . . .
b . . 19 . -65 . -71 . . . . . . . . .
c . . . . -65 . -71 . . . . . . . . .
d . . . . -65 . -71 . . . . . . . . .
e . . . . -65 . -71 . . . . . . . . .
f . . 16 . -65 . -71 . . . . . . . . .

Using the higher nibble we can pick a shift from vector of non-zero values from each column: [0, 0, 19, 4, -65, -65, -71, -71, 0, 0, 0, 0, 0, 0, 0, 0]. The lower nibble selects a bit-mask, representing a set of valid values for higher nibble. A bit-mask is derived directly from rows of the table, each non-zero value sets a bit in the mask. There are six different bit-masks.

We need an extra fix-up code to distinguish between the '+' and '/' character, i.e. choose either 16 or 19 from the 3rd column.

The core of the algorithm goes as follows:

  1. Using the lower nibble we select a bit-mask ().
  2. Then we build a vector ( or AMD XOP ).
  3. Bitwise and is used to check if all higher nibbles match corresponding bit-masks. If any doesn't match, we report an error.
  4. Finally, using the higher nibble we select shifts and add them to the input.

The SIMD (SSE) code implementing above algorithm.

__m128ilookup_pshufb_bitmask(const__m128iinput){const__m128ihigher_nibble=_mm_srli_epi32(input,4)&packed_byte(0x0f);const__m128ilower_nibble=input&packed_byte(0x0f);const__m128ishiftLUT=_mm_setr_epi8(0,0,19,4,-65,-65,-71,-71,0,0,0,0,0,0,0,0);const__m128imaskLUT=_mm_setr_epi8(/* 0 */char(0b10101000),/* 1 .. 9 */char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),/* 10 */char(0b11110000),/* 11 */char(0b01010100),/* 12 .. 14 */char(0b01010000),char(0b01010000),char(0b01010000),/* 15 */char(0b01010100));const__m128ibitposLUT=_mm_setr_epi8(0x01,0x02,0x04,0x08,0x10,0x20,0x40,char(0x80),0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00);// shift + fix-up for '/' and '+'
const__m128ish=_mm_shuffle_epi8(shiftLUT,higher_nibble);const__m128ieq_2f=_mm_cmpeq_epi8(input,packed_byte(0x2f));const__m128ishift=_mm_blendv_epi8(sh,packed_byte(16),eq_2f);// load bit-masks
const__m128iM=_mm_shuffle_epi8(maskLUT,lower_nibble);// (1 << higher_nibble[i]) & 0xff
const__m128ibit=_mm_shuffle_epi8(bitposLUT,higher_nibble);const__m128inon_match=_mm_cmpeq_epi8(_mm_and_si128(M,bit),_mm_setzero_si128());constautomask=_mm_movemask_epi8(non_match);if(mask){// report error
}const__m128iresult=_mm_add_epi8(input,shift);returnresult;}

Number of operations:

  • comparison (le/gt/eq): 2,
  • shift: 1,
  • bit-and: 3,
  • bit-or: 0,
  • bit-blend: 1,
  • add: 1,
  • movemask: 1
  • pshufb: 3.

Number of constants: 6.

The total number of operations is 12. This is 0.750 instructions per character.

Comparison of vector lookups

operation algorithm
base byte-blend incremental pshufb pshufb-bitmask
comparison (le/gt/eq) 9 9 9 3 2
bit-and 8 4 0 2 3
bit-andnot 0 0 0 1 0
bit-or 4 0 0 1 0
shift 0 0 0 1 1
add 1 1 10 2 1
sub 0 0 0 0 0
byte-blend 0 3 0 0 1
movemask 1 1 1 1 1
pshufb 0 0 1 3 3
total instructions 23 19 21 14 12
instructions/byte 1.4375 1.1875 1.3125 0.875 0.75
constants count 9 9 10 6 6

Gathering data

The result of procedure is a vector fo 32-bit words having following bit-layout:

[00dddddd|00cccccc|00bbbbbb|00aaaaaa]

Bits , , and are data. The expected output is:

[00000000|aaaaaabb|bbbbcccc|ccdddddd]

Once bits are packed into 24-bit words, all bytes are shuffled in order to build a 12-byte result. Please note that also the order of bytes within 24-bit words is changed in this step, i.e. bytes 0 and 2 are swapped:

[ddddddcc|bbbbcccc|aaaaaabb]

Fortunately everything is done by a single invocation.

Pack — naive variant

The naive variant isolates all 6-bit fields, shifts them to required positions and merges into one word.

  1. Isolate fields , and , :
#define packed_dword(x) _mm_set1_epi32(x)
#define masked(x, mask) _mm_and_si128(x, _mm_set1_epi32(mask))
const__m128ica=masked(values,0x003f003f);const__m128idb=masked(values,0x3f003f00);
  1. Swap order of , and , within 16-bit words.
// t0 = [0000cccc|ccdddddd|0000aaaa|aabbbbbb]
const__m128it0=_mm_or_si128(_mm_srli_epi32(db,8),_mm_slli_epi32(ca,6));
  1. Swap order of 12-bit fields and within a 32-bit word.
// t1 = [dddd0000|aaaaaabb|bbbbcccc|dddddddd]
const__m128it1=_mm_or_si128(_mm_srli_epi32(t0,16),_mm_slli_epi32(t0,12));
  1. Mask out garbage
// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd]
result=masked(t1,0x00ffffff);

Number of operations:

  • 3 bit-and,
  • 4 bit-shift,
  • 2 bit-or.

Pack — multiply-add variant

The first merge of adjacent bytes is performed by instruction . The instruction vertically multiplies signed bytes yielding 16-bit signed intermediate values and then the values are added. Input values are 6-bit, so obviously are non-negative and the instruction could be safely used.

The instruction is used to perform shift & bit-or in a single step.

Then merging 12-bit fields is done by another multiply-add instruction . The instruction also operates on signed values, but again --- the inputs are never negative.

// input: [00dddddd|00cccccc|00bbbbbb|00aaaaaa]
// merge: [0000cccc|ccdddddd|0000aaaa|aabbbbbb]
const__m128imerge_ab_and_bc=_mm_maddubs_epi16(values,packed_dword(0x01400140));// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd]
return_mm_madd_epi16(merge_ab_and_bc,packed_dword(0x00011000));

Number of operations:

  • 2 multiply-add.

Saving result (SSE variant)

The result is stoed in an XMM register, however only 12 lowest bytes are required. My experiments on Core i5 have showed that saving selected bytes using dedicated instruction (intrinsic ) is slower than ordinary write of 16 bytes which overwrites the previous 4 bytes.

BMI2

With help of BMI2 instruction making a 24-bit word from 6-bit words is pretty simple. However, bytes 0th and 2nd of each word have to be swapped, that costs extra instructions. Additionally, the instruction works on CPU registers and this require extra instruction to transfer data from the SIMD registers.

Here is a procedure which processes 64-bit word

uint64_tpack_bytes(uint64_tv){constuint64_tp=_pext_u64(v,0x3f3f3f3f3f3f3f3f);constuint64_tb0=p&0x0000ff0000ff;constuint64_tb1=p&0x00ff0000ff00;constuint64_tb2=p&0xff0000ff0000;return(b0<<16)|b1|(b2>>16);}

It costs:

  • 1 ,
  • 2 shift,
  • 3 bit-and,
  • 2 bit-or.

AVX2 version

An AVX2 version of procedure is nearly the same as the SSE version. The only minor difference is caused by lack of comparison "less or equal", thus all range checking have to use the "grater" relation.

An output from the AVX2 could be either processed using standard SIMD operations or BMI2 instructions.

Sample code

The sample code is available at github, there are three programs:

  • — verifies all decoding procedures, both scalar, SSE and AVX2;
  • — checks if base64 decoders work correctly.
  • — allows to measure speed of all or selected base64 decoder; it decodes 64 MiB of artificial data 10 times, and then print the smallest measurement.

Experiments

Core i5 results (Westmere)

The CPU architecture: Westmere i5 M540 @ 2.53GHz

GCC: 6.2.0 (Debian)

procedure lookup pack time [s] speedup  
improved scalar N/A N/A 0.04722 1.00 ██████████
scalar N/A N/A 0.06875 0.69 ██████▊
SSE base naive 0.02892 1.63 ████████████████▎
SSE byte blend naive 0.02970 1.59 ███████████████▉
SSE incremental naive 0.03249 1.45 ██████████████▌
SSE pshufb naive 0.02753 1.72 █████████████████▏
SSE base multiply-add 0.02554 1.85 ██████████████████▍
SSE byte blend multiply-add 0.02511 1.88 ██████████████████▊
SSE incremental multiply-add 0.02862 1.65 ████████████████▍
SSE pshufb multiply-add 0.02170 2.18 █████████████████████▊
SSE pshufb bitmask multiply-add 0.02085 2.26 ██████████████████████▋

Conclusions:

  • Eliminating 5 shifts from the scalar version boosted code by 1.45. This is impressive.
  • Using the byte-blend instruction gives significant boost.
  • Packing algorithm using two multiply-add instruction is, surprisingly, the best. Multiply-add instructions have pretty high latencies, 4-5 cycles.

Core i7 results (Haswell)

The CPU architecture: Haswell i7-4770 CPU @ 3.40GHz.

GCC: 5.4.1 (Ubuntu)

procedure lookup pack time [s] speedup  
improved scalar N/A N/A 0.02281 1.00 ██████████
scalar N/A N/A 0.04965 0.46 ████▌
scalar & BMI2 N/A N/A 0.04485 0.51 █████
SSE base naive 0.01683 1.36 █████████████▌
SSE byte blend naive 0.01783 1.28 ████████████▊
SSE incremental naive 0.01790 1.27 ████████████▋
SSE pshufb naive 0.01286 1.77 █████████████████▋
SSE base multiply-add 0.01383 1.65 ████████████████▍
SSE byte blend multiply-add 0.01365 1.67 ████████████████▋
SSE incremental multiply-add 0.01500 1.52 ███████████████▏
SSE pshufb multiply-add 0.00967 2.36 ███████████████████████▌
SSE pshufb bitmask multiply-add 0.00927 2.46 ████████████████████████▌
SSE & BMI2 base N/A 0.02184 1.04 ██████████▍
SSE & BMI2 byte blend N/A 0.02288 1.00 █████████▉
SSE & BMI2 incremental N/A 0.02190 1.04 ██████████▍
AVX2 base naive 0.01108 2.06 ████████████████████▌
AVX2 byte blend naive 0.01229 1.86 ██████████████████▌
AVX2 pshufb naive 0.00897 2.54 █████████████████████████▍
AVX2 base multiply-add 0.00973 2.34 ███████████████████████▍
AVX2 byte blend multiply-add 0.00977 2.33 ███████████████████████▎
AVX2 pshufb multiply-add 0.00849 2.69 ██████████████████████████▊
AVX2 pshufb bitmask multiply-add 0.00848 2.69 ██████████████████████████▉
AVX2 & BMI2 base N/A 0.02236 1.02 ██████████▏
AVX2 & BMI2 byte blend N/A 0.02371 0.96 █████████▌

Conclusions:

  • BMI2 doesn't help at all.

Core i7 results (Skylake)

The CPU architecture: Skylake i7-6700 CPU @ 3.40GHz

GCC: 5.4.1 (Ubuntu)

procedure lookup pack time [s] speedup  
improved scalar N/A N/A 0.02084 1.00 ██████████
scalar N/A N/A 0.04980 0.42 ████▏
scalar & BMI2 N/A N/A 0.04494 0.46 ████▋
SSE base naive 0.01502 1.39 █████████████▊
SSE byte blend naive 0.01645 1.27 ████████████▋
SSE incremental naive 0.01684 1.24 ████████████▍
SSE pshufb naive 0.01229 1.70 ████████████████▉
SSE base multiply-add 0.01216 1.71 █████████████████▏
SSE byte blend multiply-add 0.01405 1.48 ██████████████▊
SSE incremental multiply-add 0.01198 1.74 █████████████████▍
SSE pshufb multiply-add 0.00888 2.35 ███████████████████████▍
SSE pshufb bitmask multiply-add 0.00847 2.46 ████████████████████████▌
SSE & BMI2 base N/A 0.02090 1.00 █████████▉
SSE & BMI2 byte blend N/A 0.02097 0.99 █████████▉
SSE & BMI2 incremental N/A 0.02098 0.99 █████████▉
AVX2 base naive 0.01006 2.07 ████████████████████▋
AVX2 byte blend naive 0.00998 2.09 ████████████████████▉
AVX2 pshufb naive 0.00816 2.55 █████████████████████████▌
AVX2 base multiply-add 0.00852 2.45 ████████████████████████▍
AVX2 byte blend multiply-add 0.00842 2.48 ████████████████████████▊
AVX2 pshufb multiply-add 0.00720 2.89 ████████████████████████████▉
AVX2 pshufb bitmask multiply-add 0.00705 2.96 █████████████████████████████▌
AVX2 & BMI2 base N/A 0.01994 1.05 ██████████▍
AVX2 & BMI2 byte blend N/A 0.01982 1.05 ██████████▌

AMD Bulldozer results

The CPU architecture: Bulldozer FX-8150 CPU

GCC: 4.8.4 (Ubuntu)

procedure lookup pack time [s] speedup  
improved scalar N/A N/A 0.03223 1.00 ██████████
scalar N/A N/A 0.06352 0.51 █████
SSE base naive 0.02387 1.35 █████████████▌
SSE byte blend naive 0.02316 1.39 █████████████▉
SSE incremental naive 0.02626 1.23 ████████████▎
SSE pshufb naive 0.02015 1.60 ███████████████▉
SSE base multiply-add 0.02107 1.53 ███████████████▎
SSE byte blend multiply-add 0.01971 1.64 ████████████████▎
SSE incremental multiply-add 0.02215 1.46 ██████████████▌
SSE pshufb multiply-add 0.01660 1.94 ███████████████████▍
SSE pshufb bitmask multiply-add 0.01656 1.95 ███████████████████▍

Acknowledgments

The AVX2 version wouldn't be possible without Nathan Kurz's help and Daniel Lemire who kindly gave me access to test machines having brand new CPUs (Haswell and Skylake, and also AMD Bulldozer).

See also

Changelog

  • 2016-12-23 — improved pshufb-based lookup procedure
  • 2016-12-18 — align order of output words with the base64 spec)
  • 2016-12-04 — refreshed measurements, added results from AMD Bulldozer
  • 2016-12-01 — removed a superfluous instruction
  • 2016-11-07 — syntax highlighting
  • 2016-03-30 — new lookup algorithm, updated results
  • 2016-03-19 — new packing methods, updated results
  • 2016-03-13 — two new lookup algorithms, updated results
  • 2016-03-10 — AVX2 notes, Skylake results
  • 2016-01-25 — minor fixes thanks to HN comments
  • 2016-01-18 — Core i7 results