When I’m in the mood to write, you can often follow a chain of though in my posts. Recently, a post on LLM tokenization lead to a post on how Unicode characters are tokenized, which led to a post on Unicode surrogates. The latter ended by touching on Unicode’s PUA (Private Use Area), which of course leads to J.R.R. Tolkien and privacy, as we shall see.
To back up a bit, Unicode started as an attempt to create a universal alphabet, one large enough to contain all the world’s writing system. Initially it was thought that 216 = 65,536 symbols would be enough, but that didn’t last long. The Unicode standard now contains nearly 100,000 Chinese characters alone, along with myriad other symbols such as graphical drawing elements, mathematical symbols, and emoji.
Private Use Area
Although Unicode is vast, it doesn’t have room to include every symbol anyone might use. Unicode anticipated that contingency and reserved some areas for private use. The most well known example is probably Apple’s use of U+F8FF for their logo . The private use area is also being used for ancient and medieval scripts, as well as for fictional writing systems such as Tolkein’s Tengwar and Cirth.
Because anyone can use the private use area as they wish, there could be conficts. For example, Apple intends U+F8FF to display their logo, but the code point is also used for a Klingon symbol. We’ll see another example of a conflict at the end of this post.
Privacy
Here’s the chain of thought that leads to privacy. When I started thinking about this post, I thought about creating an image of Tengwar writing, and that made me think about font fingerprinting. Browsers let web servers know what fonts you have installed, which serves a benign purpose: a site may want to send you text formatted in a pariticular font if its available but will fall back to a more common font if necessary.
However, the collection of fonts installed on a particular computer may be unique, or at least used in combination with other browser information to uniquely identify a user. Before installing a Tengwar font I thought about how it would be sure to increase the uniqueness of my font fingerprint.
Tolkein’s Tengwar
J. R. R. Tolkein’s Tengwar writing system has been mapped to Unicode range E000 to E07F.
Here’s a sample. No telling what it looks like in your browser:
When I view the same text in Tecendil online transcriber I see this:
But then I open the text in Emacs I see this:
Both are legitimate renderings because nobody owns the private use areas. The characters that Tecendil uses for Tengwar, apparently some font on my laptop uses to display Chinese characters.
I’m especially curious about the last character, U+E000. Tecendil interprets it as the Tengwar symbol tinco but something on my laptop interprets it as the Mozilla mascot.