Hacker News

by firstSpeakeron 3/24/2024, 4:50 PMwith 275 comments

by reon 3/24/2024, 7:25 PM

> Can you spot any difference between “blöb” and “blöb”?

It's tricky to try to determine this because normalization can end up getting applied unexpectedly (for instance, on Mac, Firefox appears to normalize copied text as NFC while Chrome does not), but by downloading the page with cURL and checking the raw bytes I can confirm that there is no difference between those two words :) Something in the author's editing or publishing pipeline is applying normalization and not giving her the end result that she was going for.

  00009000: 0a3c 7020 6964 3d22 3066 3939 223e 4361  .<p id="0f99">Ca
  00009010: 6e20 796f 7520 7370 6f74 2061 6e79 2064  n you spot any d
  00009020: 6966 6665 7265 6e63 6520 6265 7477 6565  ifference betwee
  00009030: 6e20 e280 9c62 6cc3 b662 e280 9d20 616e  n ...bl..b... an
  00009040: 6420 e280 9c62 6cc3 b662 e280 9d3f 3c2f  d ...bl..b...?</

Let's see if I can get HN to preserve the different forms:

Composed: ü Decomposed: ü

Edit: Looks like that worked!

by mglzon 3/24/2024, 9:21 PM

My last name contains an ü and it has been consistenly horrible.

* When I try to preemptively replace ü with ue many institutions and companies refuse to accept it because it does not match my passport

* Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again

* Sometimes I can enter my name as-is and there seems to be no problem, only for some other system to mangle it to � or or a box. This often triggers error downstream I have no way of fixing

* Sometimes, people print a u and add the diacritics by hand on the label. This is nice, but still somehow wrong.

I wonder what the solution is. Give up and ask people to consistenly use a ascii-only name? Allow everybody 1000+ unicode characters as a name and go off that string? Officially change my name?

by weinzierlon 3/25/2024, 4:40 PM

This article is about a failure to do normalization properly and is not really about an issue with Unicode. Regardless what some comments seem to allude to, an Umlaut-ü should always render exactly the same, no matter how it is encoded.

There is, however, a real ü/ü conundrum, regarding ü-Umlaut and ü-diaeresis. The ü's in the words Müll and aigüe should render differently. The dots in the French word are too close to the letter. In printed French material this is usually not the case.

Unfortunately Unicode does not capture the nuance of the semantic difference between an Umlaut and a Tréma or Diaresis.

The Umlaut is a letter in its own right with its own space in the alphabet. An ü-Umlaut can never be replaced by an u alone. This would be just as wrong as replacing a p by a q. Just because they look similar does not mean they are interchangeable. [1]

The Tréma on the other hand, is a modifier that helps with proper pronunciation of letter combinations. It is not a letter in its own right, just additional information. It can sometimes move over other adjacent letters (aiguë=aigüe, both are possible) too.

Some say this should be handled by the rendering system similar to Han-Unification, but I strongly disagree with this. French words are often used in German and vice versa. Currently there is no way to render a German loan word with Umlaut (e.g. führer) properly in French.

[1] The only acceptable replacement for ü-Umlaut is the combination ue.

by noodlesUKon 3/24/2024, 7:48 PM

One thing that is very unintuitive with normalization is that MacOS is much more aggressive with normalizing Unicode than Windows or Linux distros. Even if you copy and paste non-normalized text into a text box in safari on Mac, it will be normalized before it gets posted to the server. This leads to strange issues with string matching.

by jesprenjon 3/24/2024, 7:40 PM

Should you really change filenames of users' files and depend on the fact that they are valid utf8? Wouldn't it be better to keep the original filename and use that most of the time sans the searches and indexing?

Why don't you normalize latin alphabets filenames for indexing even further -- allow searching for "Führer" with queries like "Fuehrer" and "Fuhrer"?

by josephcsibleon 3/24/2024, 9:09 PM

IMO, it was a mistake for Unicode to provide multiple ways to represent 100% identical-looking characters. After all, ASCII doesn't have separate "c"s for "hard c" and "soft c".

by layer8on 3/24/2024, 5:36 PM

The more general solution is specified here: https://unicode.org/reports/tr10/#Searching

by blablabla123on 3/24/2024, 8:15 PM

As a German macOS user with US keyboard I run into a related issue every now and then. What's nice about macOS is I can easily combine Umlaute but also other common letters from European languages without any extra configuration. But some (Web) Applications stumble upon it, while entering because it's like: 1. ¨ (Option-u) 2. ü (u pressed)

by chuckadamson 3/24/2024, 7:39 PM

Clearly the author already knows this, but it highlights the importance of always normalizing your input, and consistently using the same form instead of relying on the OS defaults.

by userbinatoron 3/24/2024, 8:26 PM

its[sic] 2024, and we are still grappling with Unicode character encoding problems

More like "because it's 2024." This wouldn't be a problem before the complexity of Unicode became prevalent.

by _nalplyon 3/24/2024, 7:24 PM

Sometimes it makes sense to reduce to Unicode confusables.

For example the Greek letter Big Alpha looks like uppercase A. Or some characters look very similar like the slash and the fraction slash. Yes, Unicode has separate scalar values for them.

There are Open Source tools to handle confusables.

This is in addition to the search specified by Unicode.

by Havocon 3/24/2024, 8:01 PM

For those intrigued by this sort of thing check tech talk “plain text” by Dylan Beattie

Absolute gem. His other talks are entertaining too

by mawiseon 3/24/2024, 7:56 PM

I ran into this building search for a family tree project. I found out that Rails provides `ActiveSupport::Inflector.transliterate()` which I could use for normalization.

by anewhnaccount2on 3/25/2024, 6:35 AM

Reminded of this classic diveintomark post http://web.archive.org/web/20080209154953/http://diveintomar...

by CoastalCoderon 3/24/2024, 9:39 PM

Isn't ü/ü-encoding a solved problem on Unix systems?

</joke>

by philkrylovon 3/25/2024, 7:52 PM

The article suggests using NFC normalization as a simple solution, but fails to mention that HFS+ always does NFD normalization to file names, and APFS kinda does not but some layer above it actually does (https://eclecticlight.co/2021/05/08/explainer-unicode-normal...), and ZFS has this behavior controlled by a dataset-level option. I don't see how applying its suggestion literally (just normalize to NFC before saving) can work.

by jphon 3/24/2024, 7:31 PM

Normalizing can help with search. For example for Ruby I maintain this gem: https://rubygems.org/gems/sixarm_ruby_unaccent

by kazinatoron 3/24/2024, 7:20 PM

Oh that Mötley Ünicöde.

by raffyon 3/24/2024, 8:58 PM

I created a bunch of Unicode tools during development of ENSIP-15 for ENS (Ethereum Name Service)

ENSIP-15 Specification: https://docs.ens.domains/ensip/15

ENS Normalization Tool: https://adraffy.github.io/ens-normalize.js/test/resolver.htm...

Browser Tests: https://adraffy.github.io/ens-normalize.js/test/report-nf.ht...

0-dependancy JS Unicode 15.1 NFC/NFD Implementation [10KB] https://github.com/adraffy/ens-normalize.js/blob/main/dist/n...

Unicode Character Browser: https://adraffy.github.io/ens-normalize.js/test/chars.html

Unicode Emoji Browser: https://adraffy.github.io/ens-normalize.js/test/emoji.html

Unicode Confusables: https://adraffy.github.io/ens-normalize.js/test/confused.htm...

by WalterBrighton 3/25/2024, 3:03 AM

> Can you spot any difference between “blöb” and “blöb”?

That's where Unicode lost its way and went into a ditch. Identical glyphs should always have the same code point (or sequence of code points).

Imagine all the coding time spent trying to deal with this nonsense.

by ulrischaon 3/24/2024, 9:20 PM

It is really so awful that we have to deal with encoding issues in 2024.

by ComputerGuruon 3/24/2024, 10:46 PM

ZFS can be configured to force the use of a particular normalized Unicode form for all filenames. Amazing filesystem.

by NotYourLawyeron 3/24/2024, 8:17 PM

ASCII should be enough for anyone.

by earthboundkidon 3/24/2024, 10:38 PM

This isn’t an encoding problem. It’s a search problem.

by juujianon 3/24/2024, 8:04 PM

I ran into encoding problems so many times, I just use ASCII aggressively now. There is still kanji, Hanzi, etc. but at least for Western alphabets, not worth the hassle.

by keyboredon 3/24/2024, 8:02 PM

I try to avoid Unicode in filenames (I’m on Linux). It seems that a lot of normal users might have the same intuition as well? I get the sense that a lot will instinctually transcode to ASCII, like they do for URLs.

The ü/ü Conundrum