Base64 decoding with SIMD instructions

Author:	Wojciech Muła
Added on:	2016-01-17
Last update:	2017-11-29 (spelling)

Contents

Introduction

Surprisingly good results of base64 encoding with SIMD instructions forced me to check the opposite algorithm, i.e. the decoding. The decoding is slightly more complicated as it has to check the input's validity.

A decoder must also consider character '=' at the end of input, but since it's done once, I didn't bother with this in a sample code.

2016-12-18 note: in the initial version of this text I wrongly assumed order of input words, Alfred Klomp noted that the standard imposes a specific order. Today's change fixes this error.

Base algorithm

In a basic scalar version the minimum input quantity is 4 character, which are transformed into 3 bytes. The scalar version uses a lookup table translating from ASCII code into a 6-bit data. For non-base64 characters the lookup table contains a special marker. After decoding four characters and checking for errors, the 6-bit words are merged to form a 24-bit word.

The inner loop of the algorithm looks like this.

// 1. translate input
constuint8_tb0=lookup[input[i+0]];constuint8_tb1=lookup[input[i+1]];constuint8_tb2=lookup[input[i+2]];constuint8_tb3=lookup[input[i+3]];if(b0,b1,b2orb3invalid){reporterror}// 2. save output bytes
// input: [00dddddd|00cccccc|00bbbbbb|00aaaaaa]
// output: [00000000|ccdddddd|bbbbcccc|aaaaaabb]
*out++=(b1>>4)|(b0<<2);*out++=(b2>>2)|(b1<<4);*out++=b3|(b2<<6);

Forming the output is expansive, it requires:

5 bit-shift,
3 bit-ors.

Improved scalar version

It is possible to get rid off the bit-shift and use four lookup tables for each input byte. Each table contains already-shifted 6-bit data fields, thanks to that only bit-ors are required. It is at cost of bigger memory requirements. The improved version requires 4kB (4*4*256), while the basic one requires only 256 bytes.

Following code shows the idea.

// 1. lookup and build output word at the same time
constuint32_tdword=lookup32[0][input[i+0]]|lookup32[1][input[i+1]]|lookup32[2][input[i+2]]|lookup32[3][input[i+3]];if(dwordisinvalid){reporterror}// 2. save 32 bits
reinterpret_cast<uint32_t*>(out)=dword;out+=3;

Another negative feature is that the method saves 4 bytes instead of 3.

SSE version

Vector lookup (base)

In an SSE version the lookup table must be encoded as a sequence of instructions. Such program will perform parallel translation of all input bytes.

Following table describes how to map an input character into a 6-bit data ( is the character code), according to the base64 standard.

range	expression	after constant folding
A-Z	i - ord('A')	i - 65
a-z	i - ord('a') + 26	i - 71
0-9	i - ord('0') + 52	i + 4
'+'	i - ord('+') + 62	i + 19
'/'	i - ord('/') + 63	i + 16
other	error

The basic step of the vector version produces a vector of constants (from the third column) for characters matching given range. After checking all five ranges, the vectors are merged. Since none of constants is zero, then if the merged vector has any zero byte, it means that the input contains an invalid character.

If everything is OK, then the last step is to simple add the input vector and the merged one. The result vector contains 6-bit data in each byte.

Procedure doing the translation.

__m128ilookup_base(const__m128iinput){// shift for range 'A' - 'Z'
const__m128ige_A=_mm_cmpgt_epi8(input,packed_byte('A'-1));const__m128ile_Z=_mm_cmplt_epi8(input,packed_byte('Z'+1));const__m128irange_AZ=_mm_and_si128(packed_byte(-65),_mm_and_si128(ge_A,le_Z));// shift for range 'a' - 'z'
const__m128ige_a=_mm_cmpgt_epi8(input,packed_byte('a'-1));const__m128ile_z=_mm_cmplt_epi8(input,packed_byte('z'+1));const__m128irange_az=_mm_and_si128(packed_byte(-71),_mm_and_si128(ge_a,le_z));// shift for range '0' - '9'
const__m128ige_0=_mm_cmpgt_epi8(input,packed_byte('0'-1));const__m128ile_9=_mm_cmplt_epi8(input,packed_byte('9'+1));const__m128irange_09=_mm_and_si128(packed_byte(4),_mm_and_si128(ge_0,le_9));// shift for character '+'
const__m128ieq_plus=_mm_cmpeq_epi8(input,packed_byte('+'));const__m128ichar_plus=_mm_and_si128(packed_byte(19),eq_plus);// shift for character '/'
const__m128ieq_slash=_mm_cmpeq_epi8(input,packed_byte('/'));const__m128ichar_slash=_mm_and_si128(packed_byte(16),eq_slash);// merge partial results
const__m128ishift=_mm_or_si128(range_AZ,_mm_or_si128(range_az,_mm_or_si128(range_09,_mm_or_si128(char_plus,char_slash))));constautomask=_mm_movemask_epi8(_mm_cmpeq_epi8(shift,packed_byte(0)));if(mask){// report an error
}return_mm_add_epi8(input,shift);}

Number of operations:

comparison (le/gt/eq): 9,
bit-and: 8,
bit-or: 4,
add: 1,
movemask: 1.

Number of constants: 9.

Total number of operations is 23. This is 1.4375 instructions per character.

Note: the SSE compare for less or greater works on signed values, in this case it is not a problem.

Vector lookup (byte blend)

Instead of creating partial shifts (constants , and so on) and then merging them with a series of bit-ors, the shift vector could be updated using the byte blend instruction ().

Below is an updated procedure.

__m128ilookup_byte_blend(const__m128iinput){__m128ishift;// shift for range 'A' - 'Z'
const__m128ige_A=_mm_cmpgt_epi8(input,packed_byte('A'-1));const__m128ile_Z=_mm_cmplt_epi8(input,packed_byte('Z'+1));shift=_mm_and_si128(packed_byte(-65),_mm_and_si128(ge_A,le_Z));// shift for range 'a' - 'z'
const__m128ige_a=_mm_cmpgt_epi8(input,packed_byte('a'-1));const__m128ile_z=_mm_cmplt_epi8(input,packed_byte('z'+1));shift=_mm_blendv_epi8(shift,packed_byte(-71),_mm_and_si128(ge_a,le_z));// shift for range '0' - '9'
const__m128ige_0=_mm_cmpgt_epi8(input,packed_byte('0'-1));const__m128ile_9=_mm_cmplt_epi8(input,packed_byte('9'+1));shift=_mm_blendv_epi8(shift,packed_byte(4),_mm_and_si128(ge_0,le_9));// shift for character '+'
const__m128ieq_plus=_mm_cmpeq_epi8(input,packed_byte('+'));shift=_mm_blendv_epi8(shift,packed_byte(19),eq_plus);// shift for character '/'
const__m128ieq_slash=_mm_cmpeq_epi8(input,packed_byte('/'));shift=_mm_blendv_epi8(shift,packed_byte(16),eq_slash);constautomask=_mm_movemask_epi8(_mm_cmpeq_epi8(shift,packed_byte(0)));if(mask){// report an error
}return_mm_add_epi8(input,shift);}

Number of operations:

comparison (le/gt/eq): 9,
bit-and: 4,
bit-or: 0,
byte-blend: 3,
add: 1,
movemask: 1.

Number of constants: 9.

The total number of operations is 19, three less than the in base version. This is 1.1875 instructions per character. However, the instruction has both latency and throughput equal 2 cycles and it seems to be a problem. As experiments showed, sometimes use of the instruction gives small improvement, but sometimes makes things slower.

Vector lookup (incremental)

This section describes another lookup procedure that employs the technique described in a separate article. The lookup procedure uses instruction. I very like this instruction, but experiments showed that this version is slower. My heart sank.

Vector lookup (pshufb)

Instead of checking all possible ranges as in all other methods, this algorithm determines possible range for a byte and the associated shift value. Then validates a byte against that single range and adds the selected shift.

To select both the range and the shift, the higher nibble of byte is used. Lets look at the rewritten lookup table:

input [hex]	valid input [ASCII]	valid range	shift
0x	invalid
1x	invalid
2x	'+'	0x2b .. 0x2b	0x3e
2x	'/'	0x2f .. 0x2f	0x3f
3x	'0' .. '9'	0x30 .. 0x39	0x34
4x	'A' .. 'O'	0x41 .. 0x4f	0x00
5x	'P' .. 'Z'	0x50 .. 0x5a	0x0f
6x	'a' .. 'o'	0x61 .. 0x6f	0x1a
7x	'p' .. 'z'	0x70 .. 0x7a	0x29
8x .. Fx	invalid

The outline of algorithm (for a single byte):

higher_nibble=byte>>4;upper_limit=upper_limit_LUT[higher_nibble];lower_limit=lower_limit_LUT[higher_nibble];if(byte<lower_limit||byte>upper_limit){reporterror}decoded=byte+shift_LUT[higher_nibble];

For nibbles in set {0, 1, 8, 9, a, b, c, d, e, f} the LUTs define invalid ranges, always yielding errors. For nibbles {3, 4, 5, 6, 7} the ranges are well defined.

The only exception occurs for nibble 2, as there are two distinct values associated with these bytes. The special case is solved by choosing the first, one-element range 0x2b and introducing an additional code handling input equal to 0x2f. The extra code doesn't complicate the algorithm too much:

higher_nibble=byte>>4;upper_limit=upper_limit_LUT[higher_nibble];lower_limit=lower_limit_LUT[higher_nibble];if((byte<lower_limit||byte>upper_limit)&&byte!=0x2f){reporterror}decoded=byte+shift_LUT[higher_nibble];if(byte==0x2f){decoded-=3;// (0x2b - 0x2f) + (0x3f - 0x3e) = -4 + 1 = -3
}

The SIMD (SSE) code implementing above algorithm:

__m128ilookup_pshufb(const__m128iinput){const__m128ihigher_nibble=_mm_srli_epi32(input,4)&packed_byte(0x0f);constcharlinv=1;constcharhinv=0;const__m128ilower_bound_LUT=_mm_setr_epi8(/* 0 */linv,/* 1 */linv,/* 2 */0x2b,/* 3 */0x30,/* 4 */0x41,/* 5 */0x50,/* 6 */0x61,/* 7 */0x70,/* 8 */linv,/* 9 */linv,/* a */linv,/* b */linv,/* c */linv,/* d */linv,/* e */linv,/* f */linv);const__m128iupper_bound_LUT=_mm_setr_epi8(/* 0 */hinv,/* 1 */hinv,/* 2 */0x2b,/* 3 */0x39,/* 4 */0x4f,/* 5 */0x5a,/* 6 */0x6f,/* 7 */0x7a,/* 8 */hinv,/* 9 */hinv,/* a */hinv,/* b */hinv,/* c */hinv,/* d */hinv,/* e */hinv,/* f */hinv);// the difference between the shift and lower bound
const__m128ishift_LUT=_mm_setr_epi8(/* 0 */0x00,/* 1 */0x00,/* 2 */0x3e-0x2b,/* 3 */0x34-0x30,/* 4 */0x00-0x41,/* 5 */0x0f-0x50,/* 6 */0x1a-0x61,/* 7 */0x29-0x70,/* 8 */0x00,/* 9 */0x00,/* a */0x00,/* b */0x00,/* c */0x00,/* d */0x00,/* e */0x00,/* f */0x00);const__m128iupper_bound=_mm_shuffle_epi8(upper_bound_LUT,higher_nibble);const__m128ilower_bound=_mm_shuffle_epi8(lower_bound_LUT,higher_nibble);const__m128ibelow=_mm_cmplt_epi8(input,lower_bound);const__m128iabove=_mm_cmpgt_epi8(input,upper_bound);const__m128ieq_2f=_mm_cmpeq_epi8(input,packed_byte(0x2f));// in_range = not (below or above) or eq_2f
// outside = not in_range = below or above and not eq_2f (from deMorgan law)
const__m128ioutside=_mm_andnot_si128(eq_2f,above|below);constautomask=_mm_movemask_epi8(outside);if(mask){reporterror}const__m128ishift=_mm_shuffle_epi8(shift_LUT,higher_nibble);const__m128it0=_mm_add_epi8(input,shift);const__m128iresult=_mm_add_epi8(t0,_mm_and_si128(eq_2f,packed_byte(-3)));returnresult;}

Number of operations:

comparison (le/gt/eq): 3,
shift: 1,
bit-and: 2,
bit-andnot: 1,
bit-or: 1,
add: 2,
movemask: 1
pshufb: 3.

Number of constants: 6.

Total number of operations is 14. This is 0.875 instructions per character.

Vector lookup (pshufb with bitmask)

This is a slightly improved version of the previous method, where the range checking is done using bit-masks.

First we group all shift values by the lower nibble; zero shift means invalid input. We can note that in each column (except the 3rd) there are two different values: zero or a specific value. The 3rd column needs a special care, but it is not difficult.

lower nibble	higher nibble ('.' denotes zero)
lower nibble	0	1	2	3	4	5	6	7	8	9	a	b	c	d	e	f
0	.	.	.	4	.	-65	.	-71	.	.	.	.	.	.	.	.
1	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
2	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
3	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
4	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
5	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
6	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
7	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
8	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
9	.	.	.	4	-65	-65	-71	-71	.	.	.	.	.	.	.	.
a	.	.	.	.	-65	-65	-71	-71	.	.	.	.	.	.	.	.
b	.	.	19	.	-65	.	-71	.	.	.	.	.	.	.	.	.
c	.	.	.	.	-65	.	-71	.	.	.	.	.	.	.	.	.
d	.	.	.	.	-65	.	-71	.	.	.	.	.	.	.	.	.
e	.	.	.	.	-65	.	-71	.	.	.	.	.	.	.	.	.
f	.	.	16	.	-65	.	-71	.	.	.	.	.	.	.	.	.

Using the higher nibble we can pick a shift from vector of non-zero values from each column: [0, 0, 19, 4, -65, -65, -71, -71, 0, 0, 0, 0, 0, 0, 0, 0]. The lower nibble selects a bit-mask, representing a set of valid values for higher nibble. A bit-mask is derived directly from rows of the table, each non-zero value sets a bit in the mask. There are six different bit-masks.

We need an extra fix-up code to distinguish between the '+' and '/' character, i.e. choose either 16 or 19 from the 3rd column.

The core of the algorithm goes as follows:

Using the lower nibble we select a bit-mask ().
Then we build a vector ( or AMD XOP ).
Bitwise and is used to check if all higher nibbles match corresponding bit-masks. If any doesn't match, we report an error.
Finally, using the higher nibble we select shifts and add them to the input.

The SIMD (SSE) code implementing above algorithm.

__m128ilookup_pshufb_bitmask(const__m128iinput){const__m128ihigher_nibble=_mm_srli_epi32(input,4)&packed_byte(0x0f);const__m128ilower_nibble=input&packed_byte(0x0f);const__m128ishiftLUT=_mm_setr_epi8(0,0,19,4,-65,-65,-71,-71,0,0,0,0,0,0,0,0);const__m128imaskLUT=_mm_setr_epi8(/* 0 */char(0b10101000),/* 1 .. 9 */char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),char(0b11111000),/* 10 */char(0b11110000),/* 11 */char(0b01010100),/* 12 .. 14 */char(0b01010000),char(0b01010000),char(0b01010000),/* 15 */char(0b01010100));const__m128ibitposLUT=_mm_setr_epi8(0x01,0x02,0x04,0x08,0x10,0x20,0x40,char(0x80),0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00);// shift + fix-up for '/' and '+'
const__m128ish=_mm_shuffle_epi8(shiftLUT,higher_nibble);const__m128ieq_2f=_mm_cmpeq_epi8(input,packed_byte(0x2f));const__m128ishift=_mm_blendv_epi8(sh,packed_byte(16),eq_2f);// load bit-masks
const__m128iM=_mm_shuffle_epi8(maskLUT,lower_nibble);// (1 << higher_nibble[i]) & 0xff
const__m128ibit=_mm_shuffle_epi8(bitposLUT,higher_nibble);const__m128inon_match=_mm_cmpeq_epi8(_mm_and_si128(M,bit),_mm_setzero_si128());constautomask=_mm_movemask_epi8(non_match);if(mask){// report error
}const__m128iresult=_mm_add_epi8(input,shift);returnresult;}

Number of operations:

comparison (le/gt/eq): 2,
shift: 1,
bit-and: 3,
bit-or: 0,
bit-blend: 1,
add: 1,
movemask: 1
pshufb: 3.

Number of constants: 6.

The total number of operations is 12. This is 0.750 instructions per character.

Comparison of vector lookups

operation	algorithm
operation	base	byte-blend	incremental	pshufb	pshufb-bitmask
comparison (le/gt/eq)	9	9	9	3	2
bit-and	8	4	0	2	3
bit-andnot	0	0	0	1	0
bit-or	4	0	0	1	0
shift	0	0	0	1	1
add	1	1	10	2	1
sub	0	0	0	0	0
byte-blend	0	3	0	0	1
movemask	1	1	1	1	1
pshufb	0	0	1	3	3
total instructions	23	19	21	14	12
instructions/byte	1.4375	1.1875	1.3125	0.875	0.75
constants count	9	9	10	6	6

Gathering data

The result of procedure is a vector fo 32-bit words having following bit-layout:

[00dddddd|00cccccc|00bbbbbb|00aaaaaa]

Bits , , and are data. The expected output is:

[00000000|aaaaaabb|bbbbcccc|ccdddddd]

Once bits are packed into 24-bit words, all bytes are shuffled in order to build a 12-byte result. Please note that also the order of bytes within 24-bit words is changed in this step, i.e. bytes 0 and 2 are swapped:

[ddddddcc|bbbbcccc|aaaaaabb]

Fortunately everything is done by a single invocation.

Pack — naive variant

The naive variant isolates all 6-bit fields, shifts them to required positions and merges into one word.

Isolate fields , and , :

#define packed_dword(x) _mm_set1_epi32(x)
#define masked(x, mask) _mm_and_si128(x, _mm_set1_epi32(mask))
const__m128ica=masked(values,0x003f003f);const__m128idb=masked(values,0x3f003f00);

Swap order of , and , within 16-bit words.

// t0 = [0000cccc|ccdddddd|0000aaaa|aabbbbbb]
const__m128it0=_mm_or_si128(_mm_srli_epi32(db,8),_mm_slli_epi32(ca,6));

Swap order of 12-bit fields and within a 32-bit word.

// t1 = [dddd0000|aaaaaabb|bbbbcccc|dddddddd]
const__m128it1=_mm_or_si128(_mm_srli_epi32(t0,16),_mm_slli_epi32(t0,12));

Mask out garbage

// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd]
result=masked(t1,0x00ffffff);

Number of operations:

3 bit-and,
4 bit-shift,
2 bit-or.

Pack — multiply-add variant

The first merge of adjacent bytes is performed by instruction . The instruction vertically multiplies signed bytes yielding 16-bit signed intermediate values and then the values are added. Input values are 6-bit, so obviously are non-negative and the instruction could be safely used.

The instruction is used to perform shift & bit-or in a single step.

Then merging 12-bit fields is done by another multiply-add instruction . The instruction also operates on signed values, but again --- the inputs are never negative.

// input: [00dddddd|00cccccc|00bbbbbb|00aaaaaa]
// merge: [0000cccc|ccdddddd|0000aaaa|aabbbbbb]
const__m128imerge_ab_and_bc=_mm_maddubs_epi16(values,packed_dword(0x01400140));// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd]
return_mm_madd_epi16(merge_ab_and_bc,packed_dword(0x00011000));

Number of operations:

2 multiply-add.

Saving result (SSE variant)

The result is stoed in an XMM register, however only 12 lowest bytes are required. My experiments on Core i5 have showed that saving selected bytes using dedicated instruction (intrinsic ) is slower than ordinary write of 16 bytes which overwrites the previous 4 bytes.

BMI2

With help of BMI2 instruction making a 24-bit word from 6-bit words is pretty simple. However, bytes 0th and 2nd of each word have to be swapped, that costs extra instructions. Additionally, the instruction works on CPU registers and this require extra instruction to transfer data from the SIMD registers.

Here is a procedure which processes 64-bit word

uint64_tpack_bytes(uint64_tv){constuint64_tp=_pext_u64(v,0x3f3f3f3f3f3f3f3f);constuint64_tb0=p&0x0000ff0000ff;constuint64_tb1=p&0x00ff0000ff00;constuint64_tb2=p&0xff0000ff0000;return(b0<<16)|b1|(b2>>16);}

It costs:

1 ,
2 shift,
3 bit-and,
2 bit-or.

AVX2 version

An AVX2 version of procedure is nearly the same as the SSE version. The only minor difference is caused by lack of comparison "less or equal", thus all range checking have to use the "grater" relation.

An output from the AVX2 could be either processed using standard SIMD operations or BMI2 instructions.

Sample code

The sample code is available at github, there are three programs:

— verifies all decoding procedures, both scalar, SSE and AVX2;
— checks if base64 decoders work correctly.
— allows to measure speed of all or selected base64 decoder; it decodes 64 MiB of artificial data 10 times, and then print the smallest measurement.

Experiments

Core i5 results (Westmere)

The CPU architecture: Westmere i5 M540 @ 2.53GHz

GCC: 6.2.0 (Debian)

procedure	lookup	pack	time [s]	speedup
improved scalar	N/A	N/A	0.04722	1.00	██████████
scalar	N/A	N/A	0.06875	0.69	██████▊
SSE	base	naive	0.02892	1.63	████████████████▎
SSE	byte blend	naive	0.02970	1.59	███████████████▉
SSE	incremental	naive	0.03249	1.45	██████████████▌
SSE	pshufb	naive	0.02753	1.72	█████████████████▏
SSE	base	multiply-add	0.02554	1.85	██████████████████▍
SSE	byte blend	multiply-add	0.02511	1.88	██████████████████▊
SSE	incremental	multiply-add	0.02862	1.65	████████████████▍
SSE	pshufb	multiply-add	0.02170	2.18	█████████████████████▊
SSE	pshufb bitmask	multiply-add	0.02085	2.26	██████████████████████▋

Conclusions:

Eliminating 5 shifts from the scalar version boosted code by 1.45. This is impressive.
Using the byte-blend instruction gives significant boost.
Packing algorithm using two multiply-add instruction is, surprisingly, the best. Multiply-add instructions have pretty high latencies, 4-5 cycles.

Core i7 results (Haswell)

The CPU architecture: Haswell i7-4770 CPU @ 3.40GHz.

GCC: 5.4.1 (Ubuntu)

procedure	lookup	pack	time [s]	speedup
improved scalar	N/A	N/A	0.02281	1.00	██████████
scalar	N/A	N/A	0.04965	0.46	████▌
scalar & BMI2	N/A	N/A	0.04485	0.51	█████
SSE	base	naive	0.01683	1.36	█████████████▌
SSE	byte blend	naive	0.01783	1.28	████████████▊
SSE	incremental	naive	0.01790	1.27	████████████▋
SSE	pshufb	naive	0.01286	1.77	█████████████████▋
SSE	base	multiply-add	0.01383	1.65	████████████████▍
SSE	byte blend	multiply-add	0.01365	1.67	████████████████▋
SSE	incremental	multiply-add	0.01500	1.52	███████████████▏
SSE	pshufb	multiply-add	0.00967	2.36	███████████████████████▌
SSE	pshufb bitmask	multiply-add	0.00927	2.46	████████████████████████▌
SSE & BMI2	base	N/A	0.02184	1.04	██████████▍
SSE & BMI2	byte blend	N/A	0.02288	1.00	█████████▉
SSE & BMI2	incremental	N/A	0.02190	1.04	██████████▍
AVX2	base	naive	0.01108	2.06	████████████████████▌
AVX2	byte blend	naive	0.01229	1.86	██████████████████▌
AVX2	pshufb	naive	0.00897	2.54	█████████████████████████▍
AVX2	base	multiply-add	0.00973	2.34	███████████████████████▍
AVX2	byte blend	multiply-add	0.00977	2.33	███████████████████████▎
AVX2	pshufb	multiply-add	0.00849	2.69	██████████████████████████▊
AVX2	pshufb bitmask	multiply-add	0.00848	2.69	██████████████████████████▉
AVX2 & BMI2	base	N/A	0.02236	1.02	██████████▏
AVX2 & BMI2	byte blend	N/A	0.02371	0.96	█████████▌

Conclusions:

BMI2 doesn't help at all.

Core i7 results (Skylake)

The CPU architecture: Skylake i7-6700 CPU @ 3.40GHz

GCC: 5.4.1 (Ubuntu)

procedure	lookup	pack	time [s]	speedup
improved scalar	N/A	N/A	0.02084	1.00	██████████
scalar	N/A	N/A	0.04980	0.42	████▏
scalar & BMI2	N/A	N/A	0.04494	0.46	████▋
SSE	base	naive	0.01502	1.39	█████████████▊
SSE	byte blend	naive	0.01645	1.27	████████████▋
SSE	incremental	naive	0.01684	1.24	████████████▍
SSE	pshufb	naive	0.01229	1.70	████████████████▉
SSE	base	multiply-add	0.01216	1.71	█████████████████▏
SSE	byte blend	multiply-add	0.01405	1.48	██████████████▊
SSE	incremental	multiply-add	0.01198	1.74	█████████████████▍
SSE	pshufb	multiply-add	0.00888	2.35	███████████████████████▍
SSE	pshufb bitmask	multiply-add	0.00847	2.46	████████████████████████▌
SSE & BMI2	base	N/A	0.02090	1.00	█████████▉
SSE & BMI2	byte blend	N/A	0.02097	0.99	█████████▉
SSE & BMI2	incremental	N/A	0.02098	0.99	█████████▉
AVX2	base	naive	0.01006	2.07	████████████████████▋
AVX2	byte blend	naive	0.00998	2.09	████████████████████▉
AVX2	pshufb	naive	0.00816	2.55	█████████████████████████▌
AVX2	base	multiply-add	0.00852	2.45	████████████████████████▍
AVX2	byte blend	multiply-add	0.00842	2.48	████████████████████████▊
AVX2	pshufb	multiply-add	0.00720	2.89	████████████████████████████▉
AVX2	pshufb bitmask	multiply-add	0.00705	2.96	█████████████████████████████▌
AVX2 & BMI2	base	N/A	0.01994	1.05	██████████▍
AVX2 & BMI2	byte blend	N/A	0.01982	1.05	██████████▌

AMD Bulldozer results

The CPU architecture: Bulldozer FX-8150 CPU

GCC: 4.8.4 (Ubuntu)

procedure	lookup	pack	time [s]	speedup
improved scalar	N/A	N/A	0.03223	1.00	██████████
scalar	N/A	N/A	0.06352	0.51	█████
SSE	base	naive	0.02387	1.35	█████████████▌
SSE	byte blend	naive	0.02316	1.39	█████████████▉
SSE	incremental	naive	0.02626	1.23	████████████▎
SSE	pshufb	naive	0.02015	1.60	███████████████▉
SSE	base	multiply-add	0.02107	1.53	███████████████▎
SSE	byte blend	multiply-add	0.01971	1.64	████████████████▎
SSE	incremental	multiply-add	0.02215	1.46	██████████████▌
SSE	pshufb	multiply-add	0.01660	1.94	███████████████████▍
SSE	pshufb bitmask	multiply-add	0.01656	1.95	███████████████████▍

Acknowledgments

The AVX2 version wouldn't be possible without Nathan Kurz's help and Daniel Lemire who kindly gave me access to test machines having brand new CPUs (Haswell and Skylake, and also AMD Bulldozer).

Changelog

2016-12-23 — improved pshufb-based lookup procedure
2016-12-18 — align order of output words with the base64 spec)
2016-12-04 — refreshed measurements, added results from AMD Bulldozer
2016-12-01 — removed a superfluous instruction
2016-11-07 — syntax highlighting
2016-03-30 — new lookup algorithm, updated results
2016-03-19 — new packing methods, updated results
2016-03-13 — two new lookup algorithms, updated results
2016-03-10 — AVX2 notes, Skylake results
2016-01-25 — minor fixes thanks to HN comments
2016-01-18 — Core i7 results

URL: http://0x80.pl/notesen/2016-01-17-sse-base64-decoding.html