Recommendations for designing magic numbers of binary file formats

by _Microfton 3/14/2025, 8:05 PMwith 86 comments

by petertoddon 3/17/2025, 6:56 PM

That's basically how I designed the magic bytes for the OpenTimestamps proof files:

    $ hexdump -C foo.ots 
    00000000  00 4f 70 65 6e 54 69 6d  65 73 74 61 6d 70 73 00  |.OpenTimestamps.|
    00000010  00 50 72 6f 6f 66 00 bf  89 e2 e8 84 e8 92 94 01  |.Proof..........|
0) Magic is at the beginning of the file.

1) Starts with a null-byte to make it clear this is binary, not text.

2) Includes a human-readable part to make it easy to figure out what the file is in hex dumps.

3) 8 bytes of randomly chosen bytes, all of which greater than 0x7F to ensure they're not ASCII.

3) Finally, a one-byte major version number.

4) Total length (including major version) is 32 bytes to fit nicely in a hex dump.

by conacloson 3/17/2025, 10:21 AM

   SHOULD include a zero byte
I guess it is expected to be at the end of the magic number to act as a null-termibated string?

   MUST include a byte sequence that is invalid UTF-8
I guess it is to differentiate a text file from a specific format?

   MUST include at least one byte with the high bit set
Any reason?

by gardaanion 3/17/2025, 5:05 PM

Wikipedia has a good explanation why the PNG magic number is 89 50 4e 47 0d 0a 1a 0a. It has some good features, such as the end-of-file character for DOS and detection of line ending conversions. https://en.wikipedia.org/wiki/PNG#File_header

by shagieon 3/17/2025, 6:10 PM

The magic file (man magic / man file) is a neat one to read. On my Mac, this is located in /usr/share/file/magic/ while I recall on a unix distribution I worked on it was /etc/magic

The file itself has a format that can test a file and identify it (and possibly more useful information) that is read by the file command.

    # Various dictionary images used by OpenFirware FORTH environment

    0       lelong  0xe1a00000
    >8      lelong  0xe1a00000
    # skip raspberry pi kernel image kernel7.img by checking for positive text length
    >>24    lelong  >0              ARM OpenFirmware FORTH Dictionary,
    >>>24   lelong  x               Text length: %d bytes,
    >>>28   lelong  x               Data length: %d bytes,
    >>>32   lelong  x               Text Relocation Table length: %d bytes,
    >>>36   lelong  x               Data Relocation Table length: %d bytes,
    >>>40   lelong  x               Entry Point: %#08X,
    >>>44   lelong  x               BSS length: %d bytes

by ajrosson 3/17/2025, 5:39 PM

Unpopular opinion: this is all needless pedantry. At best this gives parsers like file managers a cleaner path to recognizing the specific version of the specific format you're designing. Your successors won't evolve the format with the same rigor you think you're applying now. They just won't. They'll make a "compatible" change at some point in the future which will (1) be actually backwards compatible! yet (2) need to be detected in some affirmative way. Which it won't be. And your magic number will just end up being a wart like all the rest.

This isn't a solvable problem. File formats evolve in messy ways, they always have and always will, and "magic numbers" just aren't an important enough part of the solution to be worth freaking out about.

Just make it unique; read some bytes out of /dev/random, whatever. Arguments like the one here about making them a safe nul-terminated string that is guaranteed to be utf-8 invalid are not going to help anyone in the long term.

by badmintonbasebaon 3/17/2025, 4:38 PM

Then there is mkv/webm, where strictly speaking you need to implement at least part of an EBML parser to distinguish them. Possibly why no other file format adopts EBML, everything just recognizes it as either of mkv or matroska based on dodgy heuristics.

by weinzierlon 3/17/2025, 9:24 AM

Why is ELF a good example?

    7F 45 4C 46
- MUST be the very first N bytes in the file -> check

- MUST be at least four bytes long, eight is better -> check, but only four

- MUST include at least one byte with the high bit set -> nope

- MUST include a byte sequence that is invalid UTF-8 -> nope

- SHOULD include a zero byte -> nope

So, just 1.5 out of 5. Not good.

By the way, does anyone know the reason it starts with DEL (7F) specifically?

by xg15on 3/18/2025, 7:32 AM

Most of those make intuitive sense, except this one:

> MUST include a byte sequence that is invalid UTF-8

Making the magic number UTF-8 (or ASCII, which would still break the rule) would effectively turn it into a "magic string". Isn't that the better method for distinguishability? It's easier to pick unique memorable strings than unique memorable numbers, and you can also read it in a hex editor.

What would be the downsides?

Or is the idea of the requirement to distinguish the format from plaintext files? I'd think that the version number or the rest of the format already likely contained some invalid UTF-8 to ensure that.

by secondcomingon 3/17/2025, 9:36 PM

`0xcafebabe` is the ultimate winner and follows none of these rules.

by kazinatoron 3/18/2025, 4:24 PM

If there is any foreseeable need that the format will benefit from being executable, I would make the magic bytes looks like this:

  #!/usr/bin/whatever^@^@^@^@^@[HDR]
A hash bang path terminated by a null, followed by some (aligned) binary material with version information and whatnot, all fitting into around 32 bytes.

The header format could allow for variability in the path; the #! and [HDR] part could be enough to give it identify it.

by hgomersallon 3/17/2025, 10:22 AM

As anyone able to break down why those requirements are desirable?

by ks2048on 3/17/2025, 11:30 PM

Do his "good examples" even follow his recommendations? e.g. I think they don't contain a 0x00 byte.

by eternityforeston 3/17/2025, 8:35 PM

Why not just a zero followed by a UUID? UUIDs are the obvious standard everyone knows for identifying stuff.

Maybe a zero, the UUID as ASCII, then another zero, then a human readable description for debugging and search, or a structured metadata header.

But first, ask yourself why you are designing a binary format, unless maybe it's a new media container.

When would someone ever want a binary file that's not zip, SQLite, or version controllable text?

by baggy_troughon 3/17/2025, 1:54 PM

Solve this forever by choosing a header that adheres to these properties, then add a UUID for the actual format.

by _ce5eon 3/17/2025, 7:25 PM

Honestly I just do any arbitrary uint64, it's good enough for a majority of usecases.

Sometimes I like to have fun and encode a 1337-code easter egg in the hexadecimal representation