Everyone’s a đÎãçřÏŤĭċ

December 21st, 2020

TLDR: You can just use a giant lookup table to ignore accented characters in a search. But it’s interesting, I swear.

If you’re reading this, you’ve probably encountered an accented character before. Or better yet, saw a name with accented characters in it (Éric). For us code-slingers, we just know there’s this magic thing called Unicode that handles all this for us and therefore we don’t have to think too hard about it. That is, of course, until it jumped up and hit me.

Searching by Name

So you want to implement a search feature that looks someone up by name? Simple! Just filter the results that match the characters you typed in so far (let’s ignore the fancier methods such as Soundex for now). But you notice something peculiar when you search for your dear friend ‘Éric’: He’s nowhere to be found, because you typed in ‘Eric’. Well that’s easy enough to fix, just strip off the ‘´’ from the ‘E’ when matching the characters, right? Riiiiiight?

Maybe

In the King’s English, an accented character is called a ‘diacritical mark’. This is important because we need to at least use the right lingo when searching Stack Overflow for what to do (“consulting the literature”, as it’s known in the medical field). With that in hand, it turns out that most frameworks and SDKs provide a way of ignoring diacritics when matching strings: iOS has NSDiacriticInsensitiveSearch, C# has culture context and CompareOptions.IgnoreNonSpace, and Android, as usual, likes to kick you in the nuts:

You can set a Collator’s strength property to determine the level of difference considered significant in comparisons. Four strengths are ?provided: PRIMARY, SECONDARY, TERTIARY, and IDENTICAL. The exact assignment of strengths to language features is locale dependant. For example, in Czech, “e” and “f” are considered primary differences, while “e” and “ě” are secondary differences, “e” and “E” are tertiary differences and “e” and “e” are identical. The following shows how both case and accents could be ignored for US English.

_{^{Source: Android Developer Docs}}

But what if we’re just writing a program in plain ol’ C++ without a nice framework to fall back on? Well, there’s the ICU project that provides what we’d need, but it kind of sucks to have to rope in a whole internationalization library just for one feature. Is there a lighter way?

Anatomy of an É

In order to find a solution, we as usual need to know exactly what’s going on. First of all, what exactly is Unicode and UTF-8 and why do we need it in order to display accented characters? The short explanation is that there are way too many characters in the ineffable quantity of languages in the world for poor old ASCII (7-bit - 128 chars) to handle, so some clever minds got together and decided to make the character code a 32-bit value instead. Now we have enough space for all the languages on Earth (and a whole bunch of emojis), but suddenly your favorite blog with English-only characters now takes four times as much bandwidth! Ack!

We need to only transmit the minimum number of bytes required in order to construct the desired code point - that way English characters can continue to be represented by a single byte, and other languages won’t take more space than they need. We find our solution in the fact that ASCII is 7 bits, yet a byte in a modern computer is 8 bits: The continuation mark. If the 8th bit is set, then the most significant bits represent how many bytes after the current one are needed to create the final code point:

Number of Bytes	First Code Point	Last Code Point	Byte 1	Byte 2	Byte 3	Byte 4
1	U+0000	U+007F	0xxxxxxx
2	U+0080	U+07FF	110xxxxx	10xxxxxx
3	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
4	U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

_{^{Source: Wikipedia}}
This gives us UTF-8, which is a way of encoding Unicode code points using a sequence of 1-4 bytes. UTF-16 encodes them using a system of 1-2 words, and UTF-32 encodes them directly with a single 32 bit value. The Unicode code points remain the same, but the manner of expressing them changes depending on the encoding scheme used. UTF-8 is currently the most common, especially on the web, but UTF-16 remains a holdout in some areas such as Windows APIs and the Java programming language (a char in Java is 2 bytes for this reason). UTF-32 is used…somewhere, by someone who insists it’s better than all the others, I’m sure.

Another important piece of the puzzle is how exactly an accented character is represented by Unicode: Usually, a single character (e.g: ‘é’) will be represented by a specific code point (U+00E9), but you can also take a non-accented character (e.g: ‘e’), and add the diacritical mark after as a separate character ( ‘ˋ’ (U+02CB) + ‘e’ (U+0065) = è). The former (most common) is called a precomposed character, and the latter, incredibly, is called a decomposed character. Libraries like ICU allow you to convert between compositions, so removing a diacritic is as simple as converting a character to decomposed format and yeeting the diacritic. But we want to avoid ICU, so we’re not gonna do that.

A Giant Lookup Table

Yeah, my solution is really that boring, unfortunately. I noticed that there are quite a few solutions floating around GitHub specifically for node.js, which implement diacritics removers using a lookup table. We can use this as a starting point to generate a giant switch-case for C++ (just iterate through the table and console.log() the cases). Another interesting thing I noticed is that all the accented characters I cared about have code points that are represented by no less and no greater than 16 bits. So we have a pretty elegant solution:

For each UTF-8 character in an std::string, check if it has a continuation mark indicating 2 bytes;
If it does, use std::codecvt_utf8_utf16<wchar_t> to convert it to a single UTF-16 word;
If it’s a combining diacritic (decomposed format) (U+0300 -> U+036F ), skip it;
Otherwise, replace it with the value returned by our lookup table.

The downside to this solution is that it assumes that the lookup table is complete and there aren’t any cases missing. But the good news is we get a fast converter that can live in a single .cpp file. As pedestrian as the solution was, it forced me to finally learn about how Unicode works, and honestly even then it seems to just be scratching the surface. Encoding characters, like handling dates and times, is one of those things in programming that seems simple, yet proceeds to lead you down a very deep rabbit hole.