The macOS LC_COLLATE hunt: Or why does sort order differently on macOS and Linux (2020)

by g0xA52A2Aon 10/19/2025, 1:01 PMwith 22 comments

by asveikauon 10/19/2025, 5:52 PM

Sorting is language specific even if you're restricted to languages using Latin characters. Eg. How do you sort N relative to Ñ? How do you treat the Turkish variations on the letter I?

Doing a dumb sort by character or byte values is obviously the wrong call for any diacritics, but the right call may also depend on the language.

by o11con 10/19/2025, 7:57 PM

Minor note: on Debian (and possibly other distros), you don't have to use `locale-gen` to dynamically build things into `$complocaledir/locale-archive` (which, incidentally, can cause random breakage for programs that happen to start during system upgrades).

The `locales-all` package works more like macOS. It's only a ~10MB download but unpacks to take ~250MB of disk space (these numbers will vary based on your libc version and packaging format).

There are a lot of sparse arrays and UTF32 character data in compiled locales.

Incidentally, the command to dump a locale's data is:

  LC_ALL=whatever locale -ck `locale | sed 's/=.*//; /LANG\|LC_ALL/d'`

by kbdon 10/20/2025, 2:00 AM

In my Zsh startup on Mac I had to worry about collation, as I expected ~ to sort last (I have a directory prefixed with ~ to load plugins that need to be loaded last). Idk why a locale of utf-8 has it sorting differently, but I needed LC_COLLATE=C to have it sort as expected:

    # source all shell config
    export LC_COLLATE=C # ensure consistent sort, ~ at end
    for file in ~/bin/shell/**/*.(z|)sh; do
      source "$file";
    done

by kenadaon 10/19/2025, 11:09 PM

When I updated the Darwin SDK and source releases in nixpkgs last year, I tried using the FreeBSD locale data. It worked in a technical sense, but it broke things that depended on the quirks in the Apple’s locale data. That statement about compatibility is unfortunately true.

by 1a527dd5on 10/19/2025, 8:46 PM

Ask anyone who did a postgres upgrade. The words "collate" and "glibc" are enough to cause me to pause now. Learnt loads, never going to really use it again, but man do I understand the pain that causes now.

by bluedinoon 10/19/2025, 10:41 PM

Now I'm remembering all the fun we had a long time ago with php websites that used an AS/400 for a data source. They didn't sort the same, and the mom and pop web dev shop that was hired to create the web site didn't understand the issue and hacked around it and failed.

by skopjeon 10/19/2025, 5:50 PM

So the ISO way is the right way, right?

by loegon 10/19/2025, 5:10 PM

(2020)

by pjmlpon 10/19/2025, 6:06 PM

Yet another one of those POSIX and ISO things that most people don't bother to know about.

https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1...

by greesilon 10/19/2025, 7:47 PM

It's not a stable sort?