Minor note: on Debian (and possibly other distros), you don't have to use `locale-gen` to dynamically build things into `$complocaledir/locale-archive` (which, incidentally, can cause random breakage for programs that happen to start during system upgrades).
The `locales-all` package works more like macOS. It's only a ~10MB download but unpacks to take ~250MB of disk space (these numbers will vary based on your libc version and packaging format).
There are a lot of sparse arrays and UTF32 character data in compiled locales.
Incidentally, the command to dump a locale's data is:
LC_ALL=whatever locale -ck `locale | sed 's/=.*//; /LANG\|LC_ALL/d'`Updated link to the file as https://opensource.apple.com/source/adv_cmds/adv_cmds-118/us... doesn't work anymore: https://github.com/apple-oss-distributions/adv_cmds/blob/adv...
In my Zsh startup on Mac I had to worry about collation, as I expected ~ to sort last (I have a directory prefixed with ~ to load plugins that need to be loaded last). Idk why a locale of utf-8 has it sorting differently, but I needed LC_COLLATE=C to have it sort as expected:
# source all shell config
export LC_COLLATE=C # ensure consistent sort, ~ at end
for file in ~/bin/shell/**/*.(z|)sh; do
source "$file";
doneWhen I updated the Darwin SDK and source releases in nixpkgs last year, I tried using the FreeBSD locale data. It worked in a technical sense, but it broke things that depended on the quirks in the Apple’s locale data. That statement about compatibility is unfortunately true.
Ask anyone who did a postgres upgrade. The words "collate" and "glibc" are enough to cause me to pause now. Learnt loads, never going to really use it again, but man do I understand the pain that causes now.
Now I'm remembering all the fun we had a long time ago with php websites that used an AS/400 for a data source. They didn't sort the same, and the mom and pop web dev shop that was hired to create the web site didn't understand the issue and hacked around it and failed.
So the ISO way is the right way, right?
(2020)
Yet another one of those POSIX and ISO things that most people don't bother to know about.
https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1...
It's not a stable sort?
Sorting is language specific even if you're restricted to languages using Latin characters. Eg. How do you sort N relative to Ñ? How do you treat the Turkish variations on the letter I?
Doing a dumb sort by character or byte values is obviously the wrong call for any diacritics, but the right call may also depend on the language.