Ask HN: What are your Unicode woes?

by Rendelloon 6/14/2025, 2:48 PMwith 11 comments

I've always worked with text, but I only started digging deep into understanding Unicode this year.

What do HN people have to say about Unicode and UTF-{8,16,32}? Are there parts you've never really understood? Have you had unexpected bugs due to misunderstood properties of text?

by Rendelloon 6/14/2025, 2:58 PM

I (OP) have been working on some Unicode visualization tooling for a while now. The idea started when I had some buggy string-matching code. I was matching case-insensitively, then using those ranges to highlight the original text.

Turns out, sometimes changing case changes not only the number of bytes (in UTF-8), but the number of encoded characters! This led to my post "UTF-8 characters that behave oddly when the case is changed" [1], which inspired a lot of conversation that taught me a lot. After that, I started reading Unicode documentation in earnest, and building up an idea of what a new tool should show. I'm trying to make clear things I didn't (and sometimes still don't) understand, so I'd love to know what causes pains in the wild / gaps in people's understanding.

1. https://news.ycombinator.com/item?id=42014045

by NoahZunigaon 6/15/2025, 3:16 AM

I guess its kind of annoying that letters with diacritics can be represented in multiple different ways

by solardevon 6/15/2025, 12:34 AM

I don't understand the difference between a character, a codepoint, a glyph, and whatever else makes up a single "thing" in unicode.

by 0xCE0on 6/15/2025, 8:05 AM

The original intent of Unicode was great: a standard that creates a mapping between a unique number==codepoint and specific character of language (and here character means only abstract non-visual symbol==meaning, not visually rendered glyph with stylistic font of any kind). The updates for Unicode versions added more languages, even dead ones. So basically it was a historical knowledge effort also.

Then came emojis, and now the Unicode Consortium's efforts for Unicode version updates seems to be about adding more different kinds of poop emojis and shades of skin colors. Well, maybe it projects accurately the language and culture of this modern time.

UTF-8 is great because it is a superset of ASCII, but because its byte-width varies, it has more complexity for decoding/encoding it (similar to constant/variable width ISA's in CPUs).

Different languages have different concepts, e.g. text direction==flow (left/right, up/down, characters/logograms, different kind of visual cues etc.). Humans create problems when they want to combine different languages at the same time. E.g. mathematical notation is in my opinion 2D graphics, and it cannot be (usually/always) inlined with text glyphs (to be aesthetically pleasing). Same kind of problems may come when trying to inline e.g. languages with different flow directions. Its like trying to combine native GUI widgets in Win32 and Cocoa/SwiftUI and GTK/Qt/WXwidgets - the (visual) languages doesn't have the same concepts or they are conflicting.