My last name contains an ĂŒ and it has been consistenly horrible.
* When I try to preemptively replace ĂŒ with ue many institutions and companies refuse to accept it because it does not match my passport
* Especially in France, clerks try to emulate ĂŒ with the diacritics used for the trema e, Ă«. This makes it virtually impossible to find me in a system again
* Sometimes I can enter my name as-is and there seems to be no problem, only for some other system to mangle it to ïżœ or or a box. This often triggers error downstream I have no way of fixing
* Sometimes, people print a u and add the diacritics by hand on the label. This is nice, but still somehow wrong.
I wonder what the solution is. Give up and ask people to consistenly use a ascii-only name? Allow everybody 1000+ unicode characters as a name and go off that string? Officially change my name?
This article is about a failure to do normalization properly and is not really about an issue with Unicode. Regardless what some comments seem to allude to, an Umlaut-ĂŒ should always render exactly the same, no matter how it is encoded.
There is, however, a real ĂŒ/ĂŒ conundrum, regarding ĂŒ-Umlaut and ĂŒ-diaeresis. The ĂŒ's in the words MĂŒll and aigĂŒe should render differently. The dots in the French word are too close to the letter. In printed French material this is usually not the case.
Unfortunately Unicode does not capture the nuance of the semantic difference between an Umlaut and a Tréma or Diaresis.
The Umlaut is a letter in its own right with its own space in the alphabet. An ĂŒ-Umlaut can never be replaced by an u alone. This would be just as wrong as replacing a p by a q. Just because they look similar does not mean they are interchangeable. [1]
The TrĂ©ma on the other hand, is a modifier that helps with proper pronunciation of letter combinations. It is not a letter in its own right, just additional information. It can sometimes move over other adjacent letters (aiguĂ«=aigĂŒe, both are possible) too.
Some say this should be handled by the rendering system similar to Han-Unification, but I strongly disagree with this. French words are often used in German and vice versa. Currently there is no way to render a German loan word with Umlaut (e.g. fĂŒhrer) properly in French.
[1] The only acceptable replacement for ĂŒ-Umlaut is the combination ue.
One thing that is very unintuitive with normalization is that MacOS is much more aggressive with normalizing Unicode than Windows or Linux distros. Even if you copy and paste non-normalized text into a text box in safari on Mac, it will be normalized before it gets posted to the server. This leads to strange issues with string matching.
Should you really change filenames of users' files and depend on the fact that they are valid utf8? Wouldn't it be better to keep the original filename and use that most of the time sans the searches and indexing?
Why don't you normalize latin alphabets filenames for indexing even further -- allow searching for "FĂŒhrer" with queries like "Fuehrer" and "Fuhrer"?
IMO, it was a mistake for Unicode to provide multiple ways to represent 100% identical-looking characters. After all, ASCII doesn't have separate "c"s for "hard c" and "soft c".
The more general solution is specified here: https://unicode.org/reports/tr10/#Searching
As a German macOS user with US keyboard I run into a related issue every now and then. What's nice about macOS is I can easily combine Umlaute but also other common letters from European languages without any extra configuration. But some (Web) Applications stumble upon it, while entering because it's like: 1. š (Option-u) 2. ĂŒ (u pressed)
Clearly the author already knows this, but it highlights the importance of always normalizing your input, and consistently using the same form instead of relying on the OS defaults.
its[sic] 2024, and we are still grappling with Unicode character encoding problems
More like "because it's 2024." This wouldn't be a problem before the complexity of Unicode became prevalent.
Sometimes it makes sense to reduce to Unicode confusables.
For example the Greek letter Big Alpha looks like uppercase A. Or some characters look very similar like the slash and the fraction slash. Yes, Unicode has separate scalar values for them.
There are Open Source tools to handle confusables.
This is in addition to the search specified by Unicode.
For those intrigued by this sort of thing check tech talk âplain textâ by Dylan Beattie
Absolute gem. His other talks are entertaining too
I ran into this building search for a family tree project. I found out that Rails provides `ActiveSupport::Inflector.transliterate()` which I could use for normalization.
Reminded of this classic diveintomark post http://web.archive.org/web/20080209154953/http://diveintomar...
Isn't ĂŒ/ĂŒ-encoding a solved problem on Unix systems?
</joke>
The article suggests using NFC normalization as a simple solution, but fails to mention that HFS+ always does NFD normalization to file names, and APFS kinda does not but some layer above it actually does (https://eclecticlight.co/2021/05/08/explainer-unicode-normal...), and ZFS has this behavior controlled by a dataset-level option. I don't see how applying its suggestion literally (just normalize to NFC before saving) can work.
Normalizing can help with search. For example for Ruby I maintain this gem: https://rubygems.org/gems/sixarm_ruby_unaccent
Oh that Mötley Ănicöde.
I created a bunch of Unicode tools during development of ENSIP-15 for ENS (Ethereum Name Service)
ENSIP-15 Specification: https://docs.ens.domains/ensip/15
ENS Normalization Tool: https://adraffy.github.io/ens-normalize.js/test/resolver.htm...
Browser Tests: https://adraffy.github.io/ens-normalize.js/test/report-nf.ht...
0-dependancy JS Unicode 15.1 NFC/NFD Implementation [10KB] https://github.com/adraffy/ens-normalize.js/blob/main/dist/n...
Unicode Character Browser: https://adraffy.github.io/ens-normalize.js/test/chars.html
Unicode Emoji Browser: https://adraffy.github.io/ens-normalize.js/test/emoji.html
Unicode Confusables: https://adraffy.github.io/ens-normalize.js/test/confused.htm...
> Can you spot any difference between âblöbâ and âblöbâ?
That's where Unicode lost its way and went into a ditch. Identical glyphs should always have the same code point (or sequence of code points).
Imagine all the coding time spent trying to deal with this nonsense.
It is really so awful that we have to deal with encoding issues in 2024.
ZFS can be configured to force the use of a particular normalized Unicode form for all filenames. Amazing filesystem.
ASCII should be enough for anyone.
This isnât an encoding problem. Itâs a search problem.
I ran into encoding problems so many times, I just use ASCII aggressively now. There is still kanji, Hanzi, etc. but at least for Western alphabets, not worth the hassle.
I try to avoid Unicode in filenames (Iâm on Linux). It seems that a lot of normal users might have the same intuition as well? I get the sense that a lot will instinctually transcode to ASCII, like they do for URLs.
> Can you spot any difference between âblöbâ and âblöbâ?
It's tricky to try to determine this because normalization can end up getting applied unexpectedly (for instance, on Mac, Firefox appears to normalize copied text as NFC while Chrome does not), but by downloading the page with cURL and checking the raw bytes I can confirm that there is no difference between those two words :) Something in the author's editing or publishing pipeline is applying normalization and not giving her the end result that she was going for.
Let's see if I can get HN to preserve the different forms:Composed: ĂŒ Decomposed: uÌ
Edit: Looks like that worked!