Monday, March 10, 2025

Unicode surrogates

Computer scienceUnicode surrogates


At the highest level, Unicode is a simply a list of symbols. But when you look closer you find that isn’t entirely true. Some of the symbols are sorta meta symbols. And while a list of symbols is not complicated, this list is adjacent to a lot of complexity.

I’ve explored various rabbit holes of Unicode in previous posts, and this time I’d like to go down the rabbit hole of surrogates. These came up in the recent post on how LLMs tokenize Unicode characters.

Incidentally, every time I hear “surrogates” I think of the Bruce Willis movie by that name.

Historical background

Once upon a time, text was represented in ASCII, and life was simple. Letters, numbers, and a few other symbols all fit into a single eight-bit byte with one bit left over. There wasn’t any need to distinguish ASCII and its encoding because they were synonymous. The nth ASCII character was represented the same way as the number n.

The A in ASCII stands for American, and ASCII was sufficient for American English, sorta. Then people played various tricks with the leftover bit to extend ASCII by adding 128 more symbols. This was enough, for example, to do things like include the tilde on the n in jalapeño, but representing anything like Russian letters, math symbols, or Chinese characters was out of the question.

So along came Unicode. Instead of using one byte to represent characters, Unicode used two bytes. This didn’t double the possibilities; it squared them. Instead of 128 ASCII symbols, or 256 extended ASCII symbols, there was the potential to encode 216 = 65,536 symbols. Surely that would be enough! Except it wasn’t. Soon people wanted even more symbols. Currently Unicode has 154,998 code points.

Encoding

The most obvious way to represent Unicode characters, back when there were fewer than 216 characters, was as a 16-bit number, i.e. to represent each character using two bytes rather than one. The nth Unicode character could be represented the same way as the number n, analogous to the case of ASCII.

But while text could potentially require thousands of symbols, at lot of text can be represented just fine with ASCII. Using two bytes per character would make the text take up twice as much memory. UTF-8 encoding was a very clever solution to this problem. Pure ASCII (i.e. not extended) text is represented exact as in ASCII, taking up no more memory. And if text is nearly all ASCII with occasional non-ASCII characters sprinkled in, the size of the text in memory increases in proportional to the number of non-ASCII characters. Nearly pure ASCII text takes up nearly the same amount of memory.

But the cost of UTF-8 is encoding. The bit representation of the nth character is no longer necessarily the bit representation of the integer n.

UTF-16, while wasteful when representing English prose, is a direct encoding. Or at least it was until Unicode spilled over its initial size limit. You need more than a single pair of bytes to represent more than 216 characters.

Surrogate pairs

The way to represent more than 216 symbols with 16-bit characters is to designate some of the “characters” as meta characters. These special characters represent half of a pair of 16-bit units corresponding to a single Unicode character.

There are 1024 characters reserved as high surrogates (U+D800 through U+DBFF) and 1024 reserved as low surrogates (U+DC00 through U+DFFF). High surrogates contain the high-order bits and low surrogates contain the low-order bits.

Let n > 216 be the code point of a character. Then the last 10 bits of the high surrogate are the highest 10 bits of n, and the last 1o bits of the low surrogate are the lowest 10 bits of n.

In terms of math rather than bit representations, the pair of surrogates (HL) represent code point

216 + 210(H − 55396) + (L − 56320)

where 55396 = D800hex and 56320 = DC00hex.

Example

The rocket emoji has Unicode value U+1F680. The bit pattern representing the emoji is DB3DDE80hex, the combination of high surrogate U+D83D and low surrogate U+DE80. We can verify this with the following Python.

    >>> H, L = 0xD83D, 0xDE80
    >>> 2**16 + 2**10*(H - 0xD800) + L - 0xDC00 == 0x1F680
    True

When we write out the high and low surrogates in binary we can visualize how they contain the high and low bits of the rocket codepoint.

H D83D 1101100000111101 L DE80 1101111010000000 1F680 11111011010000000

High private use surrogates

The high surrogate characters U+D800 through U+DBFF are divided into two ranges. The range U+D800 through U+DB7F are simply called high surrogates, but the remaining characters U+DB80 through U+DBFF are called “high private use surrogates.”

These private use surrogates to not work any differently than the rest. They just correspond to code points in Unicode’s private use area.

Check out our other content

Check out other tags:

Most Popular Articles