This article presents a fast algorithm for converting the hexadecimal representation of byte strings into binary. I also show validation of hexadecimal strings using the same algorithm.

Most programmers should know the drill very well:

```
```#include <stddef.h>
#include <stdint.h>
#include <ctype.h>
static inline uint8_t
hex_char_to_binary ( const char hex_char )
{
const char uppercase = toupper ( hex_char );
return uppercase < 'A' ? uppercase - '0' : uppercase - 'A' + 10;
}
const uint8_t *
hex_string_to_bytes ( const char * const hex,
const uint8_t * const buffer, const size_t size )
{
size_t i = 0;
for ( i = 0; i < size; i ++ )
buffer [ i ] = (hex_char_to_binary ( hex [ i * 2 ] ) << 4)
| hex_char_to_binary ( hex [ i * 2 + 1 ] );
return buffer;
}

Every time when I write similar code, my inner self cringes. It is boring and slow; there seems to be no way out of its drudgery. I am longing for the merry days of Zilog Z80 circa 1990 when all the cool kids converted the ASCII code of a hexadecimal digit to binary using just four eight-bit machine instructions, abusing BCD arithmetics. Those days have long passed, BCD instructions are outlawed in the 64-bit mode by CPU manufacturers, and coding in assembly is a rare treat.

One winter evening, after having to write the above *yet again*,
I decided to put myself out of this misery.

Below I assume that the reader is familiar with BCD, if that is not the case, please have a look at the link above. Also, I use the standard notation of binary numbers: the natural writing order (MSB-first), bits are counted right to left, the number of the LSB is zero.

I treat a 64-bit hardware register as a group of eight packed BCD registers: lower 4 bits of every byte of a hardware register is a packed BCD. The fourth bit of every byte is the carry flag (borrow, to be more precise, because I perform only subtraction).

I begin processing by loading
the result of subtraction of the ASCII code of the character '0'
(binary `0x30`

) from a hexadecimal digit into a BCD register.
This operation
brings the BCD simulation into the initial state, the one a BCD register
would have after subtracting a value from its contents. In the case of
hexadecimal digits from zero to nine, there is no borrow (the carry flag is
zero). For the rest of hexadecimal digits (in any letter case, from
`A`

to
`F`

as well as from `a`

to `f`

), there
will be borrow (the carry flag is one).

Next, I add the value of the carry flag to the BCD register and then
I (possibly) invalidate the register by setting the the third bit
to the value of the carry flag. This trick does not affect the register
if it was loaded with the result of subtraction of `0x30`

from
hexadecimal digits from zero to nine, since the value
of the carry flag is zero. For the rest of hexadecimal digits, the result
equals to addition of nine to values from one to six which forms correct
binary values in every lower nibble of the hardware
64-bit register that holds the group.

Since BCD regsiters are scattered throughout the hardware register, as the last step, I collect them into the resulting value. This process depends on the machine's byte order, which can be handled using the C preprocessor.

The following is an example of the algorithm described above for little-endian machines:

#include <stdint.h>
static inline uint32_t
hex_to_four_bytes ( const char * const s )
{
const uint64_t x = * (uint64_t *) s - 0x3030303030303030;
const uint64_t y = x & 0x0f0f0f0f0f0f0f0f;
const uint64_t z = x & 0x1010101010101010;
const uint64_t r = (y + (z >> 4)) | (z >> 1);
return ((r >> 8) & 0x0000000f) | ((r << 4) & 0x000000f0)
| ((r >> 16) & 0x00000f00) | ((r >> 4) & 0x0000f000)
| ((r >> 24) & 0x000f0000) | ((r >> 12) & 0x00f00000)
| ((r >> 32) & 0x0f000000) | ((r >> 20) & 0xf0000000);
}

In this example, the variable `x`

holds the initial state
of the BCD group, the variable `y`

holds the BCD registers
themselves without carry flags, the variable `z`

holds
carry flags, the variable `r`

holds the final state.

This algorithm can be used as follows (assuming that the hexadecimal
string describes the number of bytes that is divisible by four, this
issue can be handled very easily by padding the beginning of the string
with the character `0`

):

`#include <stddef.h> #include <stdint.h> uint8_t hex_string_to_bytes ( const char * const hex, const uint8_t * const buffer, const size_t size ) { size_t i = 0; for ( i = 0; i < size; i += 4 ) * (uint32_t *) (buf + i) = hex_to_four_bytes ( hex + i * 2 ); return buf; }`

Depending on the use of inlining hints in the source code, the level
of compiler optimisations, and possibly manual loop unrolling
in the function `hex_string_to_bytes`

(in the case of short
hexadecimal strings of predefined length), this algorithm can be as much
as 2.75 times faster than the common approach.

A pleasant surprise expects the programmer using high levels of compiler
optimisation: the compiler may emit *real* SIMD vector instructions
for the function `hex_to_four_bytes`

. This is the case with
GNU C with `-O3`

on Intel processors. In that case,
the boost in performance can be 4 times or higher, when accounting for
input validation as described below.

In addition to conversion, this algorithm can be extended for validation of input. The validation process boils down to checking the state of BCD registers:

#include <stdint.h>
static inline uint32_t
hex_to_four_bytes ( const char * const s, int * const valid )
{
const uint64_t x = * (uint64_t *) s - 0x3030303030303030;
const uint64_t y = x & 0x0f0f0f0f0f0f0f0f;
const uint64_t z = x & 0x1010101010101010;
const uint64_t z4 = z >> 4;
const uint64_t r = (y + z4) | (z >> 1);
const uint64_t c1 = 0x0909090909090909 - ((z >> 3) | z4);
const uint64_t c2 = x & 0xc0c0c0c0c0c0c0c0;
const uint64_t c3 = ((x & 0x3030303030303030) + 0x1010101010101010)
& 0x3030303030303030;
if ( y > c1 || c2 != 0 || c3 > 0x2020202020202020 || y < z4 )
return *valid = 0;
*valid = 1;
return ((r >> 8) & 0x0000000f) | ((r << 4) & 0x000000f0)
| ((r >> 16) & 0x00000f00) | ((r >> 4) & 0x0000f000)
| ((r >> 24) & 0x000f0000) | ((r >> 12) & 0x00f00000)
| ((r >> 32) & 0x0f000000) | ((r >> 20) & 0xf0000000);
}

In this program, the variables `c1, c2,`

and `c3`

hold values required for validation. I leave figuring out the nature of
the checks performed as an exercise to the reader.

Vadim Penzin, February 19^{th}, 2019

I hereby place this article along with the accompanying source code into the public domain.

I publish this information in the hope that it will be useful,
but without ANY WARRANTY.

You are responsible for any and
all consequences that may arise as the result of using this
information.