Algorithms: Fast Conversion of Hexadecimal Strings


This article presents a fast algorithm for converting the hexadecimal representation of byte strings into binary. I also show validation of hexadecimal strings using the same algorithm.

The Common Approach

Most programmers should know the drill very well:

#include <stddef.h>
#include <stdint.h>
#include <ctype.h>

static inline uint8_t
hex_char_to_binary ( const char hex_char )
{
	const char uppercase = toupper ( hex_char );

        return uppercase < 'A' ? uppercase - '0' : uppercase - 'A' + 10;
}

const uint8_t *
hex_string_to_bytes ( const char * const hex,
                      const uint8_t * const buffer, const size_t size )
{
        size_t i = 0;

        for ( i = 0; i < size; i ++ )
                buffer [ i ] = (hex_char_to_binary ( hex [ i * 2 ] ) << 4)
                               | hex_char_to_binary ( hex [ i * 2 + 1 ] );

        return buffer;
}

Every time when I write similar code, my inner self cringes. It is boring and slow; there seems to be no way out of its drudgery. I am longing for the merry days of Zilog Z80 circa 1990 when all the cool kids converted the ASCII code of a hexadecimal digit to binary using just four eight-bit machine instructions, abusing BCD arithmetics. Those days have long passed, BCD instructions are outlawed in the 64-bit mode by CPU manufacturers, and coding in assembly is a rare treat.

One winter evening, after having to write the above yet again, I decided to put myself out of this misery.

The SIMD Simulation of BCD Arithmetics Using 64-bit Registers

Below I assume that the reader is familiar with BCD, if that is not the case, please have a look at the link above. Also, I use the standard notation of binary numbers: the natural writing order (MSB-first), bits are counted right to left, the number of the LSB is zero.

I treat a 64-bit hardware register as a group of eight packed BCD registers: lower 4 bits of every byte of a hardware register is a packed BCD. The fourth bit of every byte is the carry flag (borrow, to be more precise, because I perform only subtraction).

I begin processing by loading the result of subtraction of the ASCII code of the character '0' (binary 0x30) from a hexadecimal digit into a BCD register. This operation brings the BCD simulation into the initial state, the one a BCD register would have after subtracting a value from its contents. In the case of hexadecimal digits from zero to nine, there is no borrow (the carry flag is zero). For the rest of hexadecimal digits (in any letter case, from A to F as well as from a to f), there will be borrow (the carry flag is one).

Next, I add the value of the carry flag to the BCD register and then I (possibly) invalidate the register by setting the the third bit to the value of the carry flag. This trick does not affect the register if it was loaded with the result of subtraction of 0x30 from hexadecimal digits from zero to nine, since the value of the carry flag is zero. For the rest of hexadecimal digits, the result equals to addition of nine to values from one to six which forms correct binary values in every lower nibble of the hardware 64-bit register that holds the group.

Since BCD regsiters are scattered throughout the hardware register, as the last step, I collect them into the resulting value. This process depends on the machine's byte order, which can be handled using the C preprocessor.

The following is an example of the algorithm described above for little-endian machines:

#include <stdint.h>

static inline uint32_t
hex_to_four_bytes ( const char * const s )
{
        const uint64_t x = * (uint64_t *) s - 0x3030303030303030;
        const uint64_t y = x & 0x0f0f0f0f0f0f0f0f;
        const uint64_t z = x & 0x1010101010101010;
        const uint64_t r = (y + (z >> 4)) | (z >> 1);

        return    ((r >>  8) & 0x0000000f) | ((r <<  4) & 0x000000f0)
                | ((r >> 16) & 0x00000f00) | ((r >>  4) & 0x0000f000)
                | ((r >> 24) & 0x000f0000) | ((r >> 12) & 0x00f00000)
                | ((r >> 32) & 0x0f000000) | ((r >> 20) & 0xf0000000);
}

In this example, the variable x holds the initial state of the BCD group, the variable y holds the BCD registers themselves without carry flags, the variable z holds carry flags, the variable r holds the final state.

This algorithm can be used as follows (assuming that the hexadecimal string describes the number of bytes that is divisible by four, this issue can be handled very easily by padding the beginning of the string with the character 0):

#include <stddef.h>
#include <stdint.h>

uint8_t
hex_string_to_bytes ( const char * const hex,
                      const uint8_t * const buffer, const size_t size )
{
        size_t i = 0;

        for ( i = 0; i < size; i += 4 )
                * (uint32_t *) (buf + i) = hex_to_four_bytes ( hex + i * 2 );

        return buf;
}

Performance

Depending on the use of inlining hints in the source code, the level of compiler optimisations, and possibly manual loop unrolling in the function hex_string_to_bytes (in the case of short hexadecimal strings of predefined length), this algorithm can be as much as 2.75 times faster than the common approach.

A pleasant surprise expects the programmer using high levels of compiler optimisation: the compiler may emit real SIMD vector instructions for the function hex_to_four_bytes. This is the case with GNU C with -O3 on Intel processors. In that case, the boost in performance can be 4 times or higher, when accounting for input validation as described below.

Validation of Hexadecimal Strings

In addition to conversion, this algorithm can be extended for validation of input. The validation process boils down to checking the state of BCD registers:

#include <stdint.h>

static inline uint32_t
hex_to_four_bytes ( const char * const s, int * const valid )
{
        const uint64_t x = * (uint64_t *) s - 0x3030303030303030;
        const uint64_t y = x & 0x0f0f0f0f0f0f0f0f;
        const uint64_t z = x & 0x1010101010101010;
        const uint64_t z4 = z >> 4;
        const uint64_t r = (y + z4) | (z >> 1);
        const uint64_t c1 = 0x0909090909090909 - ((z >> 3) | z4);
	const uint64_t c2 = x & 0xc0c0c0c0c0c0c0c0;
	const uint64_t c3 = ((x & 0x3030303030303030) + 0x1010101010101010)
                & 0x3030303030303030;
	
	if ( y > c1 || c2 != 0 || c3 > 0x2020202020202020 || y  < z4 ) 
		return *valid = 0;
               
        *valid = 1;

        return    ((r >>  8) & 0x0000000f) | ((r <<  4) & 0x000000f0)
                | ((r >> 16) & 0x00000f00) | ((r >>  4) & 0x0000f000)
                | ((r >> 24) & 0x000f0000) | ((r >> 12) & 0x00f00000)
                | ((r >> 32) & 0x0f000000) | ((r >> 20) & 0xf0000000);
}

In this program, the variables c1, c2, and c3 hold values required for validation. I leave figuring out the nature of the checks performed as an exercise to the reader.

Vadim Penzin, February 19th, 2019


I hereby place this article along with the accompanying source code into the public domain.
You are welcome to contact me by writing to howto at this domain.
I publish this information in the hope that it will be useful, but without ANY WARRANTY.
You are responsible for any and all consequences that may arise as the result of using this information.