Thinking in Bits

As a big proponent of Tonsky’s Performance first philosophy, I believe every software engineer should pay attention to performance from the start. If used in the right places, bitwise operators can multiply(!) the speed of your programs without compromising on readability.

However, unlike some performance low-hanging fruits found in complex frameworks, code that works at the bit level should be introduced early on in the project. Serialization protocols are mostly set in stone from day 1, as updating them on a production environment normally requires downtime to perform the necessary migrations.

It’s good to practice fundamentals from time to time. You never know when these will come in handy. Sure, if you’re a frontend UI engineer, the & (different from &&!), ^ or ~ operators may be outright new to you. This is not to say that they are useless or “too primitive”: the Apollo Guidance Computer was entirely built from NOR gates.

Armstrong training in the lunar module simulator at Kennedy Space Center (Source)

Armstrong training in the lunar module simulator at Kennedy Space Center (Source)

To explore some use cases of these operators, we’re going to design a program to parse Minecraft’s Chunk Data packet. Videogames require various data structures to optimize the amount of bytes sent over the wire to reduce latency. In particular, this packet contains the game’s block data serialized as a compacted array of values. This data structure must be designed with performance in mind since a Minecraft server can send up to 150 packets to each player in 50 ms.1 Alstad, T. et al. (2015). Minecraft Computer Game Performance Analysis and Network Traffic Emulation by a Custom Bot. Science and Information Conference (SAI), 227-236.

A compacted array holds a fixed number length of integers with bitsPerValue bits each. It has the following API:

public class CompactedArray {

    public CompactedArray(final int length, final int bitsPerValue) {
        // ...
    }

    public int get(final int index) {
        // ...
    }

    public void set(final int index, final int value) {
        // ...
    }
}

The main difference between CompactedArray and a regular int[] is that it supports words of any size less than or equal to 32. Indeed, if bitsPerValue were 8, 16, 32 or 64 the primitive array counterpart would outperform the CompactedArray implementation. Method invocation and object memory overhead come to mind. In fact, Minecraft used to store block IDs as 4-bit values in a byte array, but they quickly ran out of IDs for new blocks and overhauled the game’s storage and protocol format.

It seems natural that CompactedArray should be backed by a primitive array given their similarities: they both hold a fixed number of values of a certain bit length. But what type should we employ?

In the worst case scenario (bitsPerValue is 32), retrieving a CompactedArray value from a byte array requires 4 memory accesses. Assuming the values are in the processor’s L1 cache, that’s still 4 cycles per byte.2 Intel Optimization Reference Manual Multiply that by 20 if the values need to be fetched from system memory. This rough estimate doesn’t even take into account the JVM bounds checking. Minimizing memory accesses yields better performance. Since bitsPerValue 1,,32\in 1, \dots, 32, a short array still requires 2 memory accesses if bitsPerValue is greater than 16. See the pattern? As the backing array word size increases, The amount of bits per value is known as the word size. the number of required memory accesses decreases. This means a long array is the best candidate:

public class CompactedArray {

    private final long[] data;
    private final int bitsPerValue;

    public CompactedArray(final int length, final int bitsPerValue) {
        this.bitsPerValue = bitsPerValue;
        // TODO Initialize data array
    }

    // ...
}

A new question arises: how big should the data array be? For brevity, let ll and bb be length and bitsPerValue respectively. The problem is equivalent to finding the least multiple of 64 which can store lblb bits. Mathematically, this is

data array size=lb64, \text{data array size} = \left\lceil \frac{lb}{64} \right\rceil,

where \left\lceil \cdot \right\rceil denotes the ceiling function. For example, if we wanted to store 32 values of 5 bits each, the data array length would be 32×564=3\left\lceil \frac{32 \times 5}{64} \right\rceil = 3.

public class CompactedArray {

    public CompactedArray(final int length, final int bitsPerValue) {
        // Using Math.ceil is suboptimal, see Appendix 1
        this.data = new long[(int) Math.ceil(length * bitsPerValue / 64D)];
        // ...
    }

    // ...
}

A CompactedArray can now be constructed, but it’s pretty useless in its current form. Let’s first work on value retrieval. The choice of a long array as the backing structure means a value can be found within a single long; or overlapping between 2 longs, where the least significant bits (LSBs) can be found in the first long, and the remaining ones at the beginning of the second. More formally, given a value starting at position 0s<640 \leq s \lt 64 of a long, the number of bits of that value stored in the long is 1min{b,64s}b1 \leq \min\{b, 64 - s\} \leq b, meaning max{b(64s),0}\max\{b - (64 - s), 0\} bits are stored in the next long. As an exercise, find how many bytes can a value with bb bits overlap.

Given a CompactedArray value with index ii, we know its first bit is stored in position bibi, which lays within the bi64\left\lfloor \frac{bi}{64} \right\rfloor-th long. The notation is making this seem worse than it is. In code,

public int get(final int index) {
    int bitOffset = index * bitsPerValue;
    // Integer division truncates the result
    int offset = bitOffset / 64;
    // ...
}

The right-shift operator

It’s time to introduce the first bitwise operator: the right shift (>>). Imagine we want to compute a/ba / b. If bb can be expressed as a power of 2, i.e. b=2xb = 2^x for some xx, right shifting aa by xx, a >> x, is equivalent to dividing aa by bb.

In Algebra class you probably learnt that dividing exponentials with the same base is equivalent to subtracting their exponents. Notice that every bit of an int represents a power of 2. For example, letting bitOffset be 197, offset is

19764=27+26+22+2026=21+20+24+26=21+20=3. \begin{aligned} \biggl\lfloor \frac{197}{64} \biggr\rfloor = \left\lfloor \frac{2^7 + 2^6 + 2^2 + 2^0}{2^6} \right\rfloor &= \left\lfloor 2^1 + 2^0 + 2^{-4} + 2^{-6} \right\rfloor \ &= 2^1 + 2^0 = 3. \end{aligned}

As 64 can be expressed as 262^6, bitOffset / 64 is the same as bitOffset >> 6. We’re not interested in the fractional part (the negative powers of 2), so it is discarded.

The bitwise AND operator

Having computed the data array index at which the LSBs of the value can be found, it’s time to figure out which bits we need to extract from it. The starting bit position within the long is s=bimod64s = bi \bmod 64. In code,

public int get(final int index) {
    int bitOffset = index * bitsPerValue;
    int offset = bitOffset >> 6;
    int startBit = bitOffset % 64;
    // ...
}

This works, but a modulo reduction involves a division. Divisions are much more expensive than bitwise operations.3 A fast alternative to the modulo reduction We introduce the AND (&) operator. Given two binary numbers a,ba, b, we want to obtain what 1-bits they have in common. For each power of 2, this operator compares the bits from each argument: if both are 1, the resultant bit is 1. Otherwise, it is 0. For instance, 1001&0101=00011001 \And 0101 = 0001.

Representing bitOffset in binary, i.e. as j=031bj2j\sum_{j=0}^{31} b_j 2^j where bjb_j is the jjth bit of bitOffset, one can factorize and represent this value as 26(b620++b312316)+b525++b0202^6(b_6 2^0 + \dots + b_{31} 2^{31-6}) + b_5 2^5 + \dots + b_0 2^0. Reducing this value modulo 64=2664 = 2^6 results in the non-factorized part b525++b020b_5 2^5 + \dots + b_0 2^0. This is the same as extracting the 6 first bits using bitOffset & 0b111111, which is much faster. This technique is known as bit masking.

The problem is now divided in two: if the value fits in the long (i.e. 64(bimod64)b64 - (bi \bmod 64) \leq b) we need to extract bb bits by applying a bit mask. However, if the value is overlapped between two longs, we need to join the least and most significant bits (MSBs).

The first case can be expressed as a “power-of-2 filter”. The value is equal to the sum of all bits that lay between ss and s+bs + b. That is,

value=j=ss+b1bj2js. \text{value} = \sum_{j=s}^{\mathclap{s + b - 1}} b_j 2^{j - s}.

The inequality sj<s+bs \leq j \lt s + b is equivalent to 0js<b0 \leq j - s \lt b. Applying this shift by ss in code is expressed as data[offset] >>> startBit (the extra > tells Java to also shift the sign bit4 JLS, Section 15.19), moving bits in positions [s,s+b)[s, s + b) to [0,b)[0, b). However, this right shift is insufficient, since it also shifts all bits in positions greater than s+bs + b. To ignore these, we apply a bitmask that only leaves the first bb bits:

public class CompactedArray {

    // ...
    private final long bitmask;

    public CompactedArray(final int length, final int bitsPerValue) {
	// ...
	// Left shifting 0b1 by bitsPerValue results in 2^bitsPerValue
	// If bitsPerValue were 5, 2^5 - 1 = 0b100000 - 0b1 = 0b011111,
	// the bitmask we're interested in.
        this.bitmask = (1L << bitsPerValue) - 1;
    }

    public int get(final int index) {
        int bitOffset = index * bitsPerValue;
        int offset = bitOffset >> 6;
        int startBit = bitOffset & 0b111111;

        return (int) ((data[offset] >>> startBit) & bitmask);
    }
}

The left-shift operator

What about values overlapped between two longs? Since the LSBs of the value are stored in the left-most bits s,,63s, \dots, 63 of the first long, Java fills the MSBs of data[offset] >>> startBit with zeros after shifting. The first long contains 64s64 - s bits of the final value, meaning the next long contains the remaining b(64s)b - (64 - s) bits. Somehow, we need to join the LSBs from the first long (which we already have) with the MSBs stored as the LSBs of the second long data[offset + 1].

Directly adding these bits would yield incorrect results, since the MSBs are misplaced: for bitsPerValue value of 18, the value at index 3 would be in bits 54 through 63 of the first long and bits 0 through 7 of the second long. However, if we shift the latter 6454=1064 - 54 = 10 positions to the left both values fit like two puzzle pieces.

More rigorously, the positions we have to shift the MSBs by is exactly the number of bits we have from the first long, 64s64 - s.

In contrary to its sibling, left shifting aa by bb, a << b is equivalent to multiplying aa by 2b2^b. To sum up, given the bits b0,,bb(64s)b_0, \dots, b_{b - (64 - s)} from the second long,

value=LSBs+j=0b(64s)2j+64sbj, \text{value} = \text{LSBs} + \sum_{j=0}^{b - (64 - s)} 2^{j + 64 - s} b_j,

which in code is

int shiftBy = 64 - startBit;
long msbs = data[offset + 1] << shiftBy;

The bitwise OR operator

Given the binary numbers a,ba, b; the OR (|) operator tells us what bits of aa or bb are 1. For each position, the operator compares the bits from each argument: if either or both of them are 1, the resultant bit is 1. Otherwise, it is 0. For example, 11001010=11101100 \mid 1010 = 1110.

Since the LSBs and MSBs of value don’t overlap, all the bits to the left of the LSBs are zero, and all the bits to the right of the MSBs are zero, LSBsMSBs\text{LSBs} \mid \text{MSBs} yields the value we’re looking for!

public int get(final int index) {
    // Apply the bitmask to zero out the bits of `msbs` we're not interested in
    return (int) ((lsbs | msbs) & bitmask);
}

Given an index, everything left is to check in which case we’re in: does the value fit in a long or does it overlap? This is easy, we only need to check whether the last bit s+bs + b is greater than 64. Another alternative is computing the second long’s offset and checking if it is is different from offset:

int nextOffset = (bitOffset + bitsPerValue) >> 6;

if (offset != nextOffset) {
    // The starting bit and ending bit are in different longs
}

On my machine, this is 1.7x slower than the greater-than comparison.JMH benchmark Generally, integer equality and inequality take the same CPU cycles to run.5 This SO answer explains this in more detail. The second method introduces a right shift, which takes an extra clock cycle and confuses the branch predictor. I challenge you to create a (faster) branchless version — I’ve tried and failed.

Putting it all together,

public int get(final int index) {
    int bitOffset = index * bitsPerValue;
    int offset = bitOffset >> 6;
    int startBit = bitOffset & 0b111111;

    // Get least significant bits
    long value = data[offset] >>> startBit;

    if (startBit + bitsPerValue > 64) {
        int shiftBy = 64 - startBit;
        // OR shifted most significant bits
        value |= data[offset + 1] << shiftBy;
    }

    return (int) (value & bitmask);
}

The set method involves the same offsets. Computing the updated data[offset] and data[offset + 1] values is a tad more tricky though. Given the previous data, this method shall replace the relevant bits with the new integer value of bitsPerValue bits. Its LSBs will go on positions ss to min{s+b,63}\min \{ s + b, 63 \} of the updated data[offset].

We already know how to align these bits. Namely, we apply the bitmask (to ensure the given value has exactly bitsPerValue bitsNever trust the caller!) using the AND operator and left shift the LSBs startBits positions. The resultant value is of the form 00bbb0000 \dots 0 b_b \dots b_0 0 \dots 0. However, a new problem arises: how do we update these bits while leaving others untouched?

The bitwise complement operator

By zeroing the bits of data[offset] that don’t correspond to the given index, we get the two non-overlapping longs we can OR to get the new value. For this, we construct a bitmask such that, when applied to the old value (by using the AND operator), only the bits we don’t want to update remain. If we left shift bitmask by startBit positions we get a long of the form 0011b times000 \dots 0 \underbrace{1 \dots 1}_{b \text{ times}} 0 \dots 0. This is the opposite of what we want.

The bitwise complement (~) operator inverts the bits of the operand. This is a unary operator, meaning it only takes one operand. It makes every 0 bit a 1 bit and every 1 bit a 0 bit, e.g. 1011=0100{\backsim 1011} = 0100.

We apply the inverted mask and OR the result with the shifted LSBs to get the new value.

public void set(final int index, final int value) {
    int bitOffset = index * bitsPerValue;
    int offset = bitOffset >> 6;
    int startBit = bitOffset & 0b111111;
    long maskedValue = value & bitmask; // Ensure value has bitsPerValue bits

    data[offset] =
        // The previous long bits with b zero bits in positions s to min(s + b, 63)
        (data[offset] & ~(bitmask << startBit))
        // The new value shifted to replace the b zeros
        | (maskedValue << startBit);
}

The second long mask is of the form 1100(64s) times.1 \dots 1 \underbrace{0 \dots 0}_{(64 - s) \text{ times}}. We could construct a bitmask and invert it as in the previous example. However, right shifting by 64s64 - s and then left shifting by this same value effectively clears the first 64s64 - s bits. The performance of both methods is comparable, but the former requires a bitmask construction. JMH benchmark

We have assembled the complete set method:

public void set(final int index, final int value) {
    int bitOffset = index * bitsPerValue;
    int offset = bitOffset >> 6;
    int startBit = bitOffset & 0b111111;
    long maskedValue = value & bitmask;

    data[offset] = (data[offset] & ~(bitmask << startBit)) | (maskedValue << startBit);

    if (startBit + bitsPerValue > 64) {
        int shiftBy = 64 - startBit;
        data[offset + 1] = (data[offset + 1] >>> shiftBy << shiftBy) | (maskedValue >> shiftBy);
    }
}

For better readability, one should replace 6 and 0b111111 by named constants. The documented class (with bounds checks and additional methods) and associated tests are available on this gist.

Appendix 1: Fast ceiling division

Given a,bZ+a, b \in \mathbb{Z^+}, we want ab\left\lceil \frac{a}{b} \right\rceil. Note this is not the same as ab+1\left\lfloor \frac{a}{b} \right\rfloor + 1 which could be implemented with simple integer truncation. For a=ba = b the former yields 1 while the latter gives 2. In fact, the result is off by 1 iff bb divides aa.

One could fiddle with Java’s Math.floorDiv function and the remainder operator, but I find this solution by Kavi Gupta quite elegant and fast:

/**
 * Returns the smallest integer that is greater than or equal to the given algebraic quotient.
 *
 * @param x the dividend
 * @param y the divisor
 * @return the smallest {@code int} value that is greater than or equal to the algebraic
 *         quotient.
 * @throws ArithmeticException if the divisor {@code y} is zero
 * @see Math#floorDiv(int, int) for special cases involving {@link Integer#MAX_VALUE}
 */
public static int ceilDiv(final int x, final int y) {
    return -Math.floorDiv(-x, y);
}