Hugo Manrique

Varints

Variable length encoding is a technique used to compress down integers into a smaller number of bytes than is normally needed, but how does it work?

While coding Jacobin, a binary stream reader library, I had to learn about all the low-level details of serializing data such as endianness, character encodings... and one aspect I believe was fundamental to understand was how variable length encoding works.

Variable length encoding is a technique used to compress down integers into a smaller number of bytes than is normally needed. Generally, computers use fixed length types, meaning data is stored in a fixed number of bytes in memory, which makes performing arithmetic and casting require less CPU cycles. However, when transmitting data through the network or storing them on disk, it's important to compress them down in order to save bandwidth and decrease disk usage, which generally results in hefty performance gains and reduced infrastructure costs.

Let's imagine we are in charge of designing a thermometer app that periodically stores temperature delta values on disk. If our app stored these values as a fixed length integer type, say a 64-bit signed integer, we would be wasting space most of the time, because delta values tend to have a much smaller range. Specifically, if most delta values fit in a single byte, the distribution of delta values would tend to follow a Poisson distribution with λ1\lambda\approx1, where λ\lambda is the average non-zero bytes needed to represent a delta value.

Distribution of byte lengths

Varints are based on this same premise: smaller numbers in computing are more common than larger ones. The trade-off variable length encoding makes is to spend more bytes to represent larger numbers, and less to represent smaller ones. In this post we explore one of the most common ways of encoding varints: continuation bits.

Little Endian Base-128 encoding

RPC libraries like Google Protocol buffers or network protocols from games such as Minecraft use this technique, where the main idea is to split the number we want to encode into 7-bit blocks and use a byte (8 bits) to encode each one of them. The 7 lower bits of each byte are used to store the binary representation of the block, and the most significant bit, called the continuation bit, indicates whether there are further bytes to come, i.e. if its value is 1 it means the byte is a continuation block, whereas a 0 value means that the block is the termination block.

Coming back to the thermometer app example, all delta values in the range 0 through 127 would only need one byte to be encoded (this is why this encoding is base-128). Let's encode the number 23:

2310=24+22+21+20=101112binary representation23_{10} = 2^4+2^2+2^1+2^0 = \underbrace{10111_2}_{\mathrm{binary\ representation}}

This number only has 5 bits, so it fits in a single 7-bit block. This number would be padded with zeros and sent as 0001 0111, where the top (leftmost) bit is 0, indicating there are no more bytes coming. We now know how to encode the first 128 integers, but what about all the other ones? Let's encode the number 62129 step by step:

6212910=1111001010110001262129_{10} = 1111001010110001_2

We first split up the number into 7-bit blocks (starting from the least-significant block):

11
1100101
0110001

Then, we set the most significant bit of each block to 1, except for the leftmost block, the termination block, which we will pad with zeros instead:

00000011
11100101
10110001

You might think this doesn't make sense, because 1 indicates a continuation block and 0 the termination block, but remember this encoding is little endian, which means the least significant group comes first, so we encode the number by reversing the order of the blocks:

10110001
11100101
00000011

That's it! The number 62129 can be encoded using 3 bytes. Notice how the top bit of each byte now tells if there's more blocks coming: if it's a 1, we know we need to keep listening to receive the full number, whereas a 0 value means that the block is the termination block.

Decoding happens in reverse order: for each byte we receive, we need to take the 7 least-significant bits by using a bit mask, specifically block & 0x7F, multiply them by 27n2^{7n}, where nn is the current (0-based) index of the input type, and break out of the loop when we encounter a byte that has a continuation bit equal to 0.

Varints are really powerful since they can encode a number of arbitrary size! With a fixed length representation you are limited by a maximum value, but with varints you can encode big and small numbers alike. Now, this doesn't mean you should change all your integer types to varints, because size reduction comes at a performance cost. Before doing any type change you should ask yourself this questions:

  • Do the values I send/store follow a distribution varints would be beneficial for? If the probability of sending a larger integer is the same as of a smaller one, variable length encoding won't really decrease, and can even increase (remember larger integers take a larger number of bytes than their binary representation, specifically 8/78/7ths more) the space you need to represent your data.
  • Does the performance gained by decreasing network delay or disk read times outweight the varint encoding/decoding overhead?

Pros

Base-128 encoding has several advantages over other variable length encoding methods, here's some of them:

  • Varints are byte aligned, which means there's no need to pad encoded numbers to a byte boundary.
  • Highly space-efficient. Any 64-bit number can be encoded in at most 10 bytes.
  • Varints are self-delimiting, which means you can concatenate them. Because it is always clear where one value ends and another begins, it's possible to write them out in order. Note that you should still check for overflows in case you miss a UDP packet or a disk read is erroneous.

Java implementation

Here's a Java implementation of little endian base-128 encoding and decoding:

static final int VALUE_MASK = 0b01111111;
static final int CONTINUATION_BIT_MASK = 0b10000000;

public static int readVarInt(InputStream stream) throws IOException {
    int readBytes = 0;
    int result = 0;

    byte current;

    do {
        current = (byte) stream.read();

        // Get the block value
        byte value = (byte) (current & VALUE_MASK);

        // Shift block by 7 * readBytes and sum it to result
        result |= (value << (7 * readBytes));

        readBytes++;

        // Check if the value overflows
        if (readBytes > 5) {
            throw new RuntimeException("VarInt is too big");
        }
    } while ((current & CONTINUATION_BIT_MASK) != 0); // Break if `current` is a termination byte

    return result;
}
public static void writeVarInt(OutputStream stream, int value) throws IOException {
    do {
        byte current = (byte) (value & VALUE_MASK);

        // Remove the block (7 bits) read in `current`
        // The >>> operator also shifts the sign bit
        value >>>= 7;

        if (value != 0) {
            // We still have more blocks
            current |= CONTINUATION_BIT_MASK;
        }

        stream.write(current);
    } while (value != 0);
}

There is a large amount of other really cool encoding schemes with diverse trade-offs. Systematically analyzing the distribution of your numbers is the key to choosing or even creating your own encoding scheme.

Recommended reads

© 2019 Hugo Manrique