Hacker News

by davecheneyon 11/5/2023, 1:10 PMwith 189 comments

by jchwon 11/5/2023, 5:25 PM

Looks pretty good! Even though I've written far too many JSON parsers already in my career, it's really nice to have a reference for how to think about making a reasonable, fast JSON parser, going through each step individually.

That said, I will say one thing: you don't really need to have an explicit tokenizer for JSON. You can get rid of the concept of tokens and integrate parsing and tokenization entirely. This is what I usually do since it makes everything simpler. This is a lot harder to do with something like the rest of ECMAscript since in something like ECMAscript you wind up needing look-ahead (sometimes arbitrarily large look-ahead... consider arrow functions: it's mostly a subset of the grammar of a parenthesized expression. Comma is an operator, and for default values, equal is an operator. It isn't until the => does or does not appear that you know for sure!)

by eatonphilon 11/5/2023, 1:22 PM

The walkthrough is very nice, how to do this if you're going to do it.

If you're going for pure performance in a production environment you might take a look at Daniel Lemire's work: https://github.com/simdjson/simdjson. Or the MinIO port of it to Go: https://github.com/minio/simdjson-go.

by wood_spiriton 11/5/2023, 10:21 PM

My own lessons from writing fast json parsers has a lot of language-type things but here are some generalisations:

Avoid heap allocations in tokenising. Have a tokeniser that is a function that returns a stack-allocated struct or an int64 token that is a packed field describing the start, length and type offsets etc of the token.

Avoid heap allocations in parsing: support a getString(key String) type interface for clients that what to chop up a buffer.

For deserialising to object where you know the fields at compile time, generally generate a switch case of key length before comparing string values.

My experience in data pipelines that process lots of json is that choice of json library can be a 3-10x performance difference and that all the main parsers want to allocate objects.

If the classes you are serialising or deserialising is known at compile time then Jackson Java does a good job but you can get a 2x boost with careful coding and profiling.

Whereas if you are paying aribrary json then all the mainstream parsers want to do lots of allocations that a more intrusive parser that you write yourself can avoid, and that you can make massive performance wins if you are processing thousands or millions of objects per second.

by jensneuseon 11/5/2023, 4:39 PM

I've taken a very similar approach and built a GraphQL tokenizer and parser (amongst many other things) that's also zero memory allocations and quite fast. In case you'd like to check out the code: https://github.com/wundergraph/graphql-go-tools

by evmaron 11/5/2023, 5:50 PM

In n2[1] I needed a fast tokenizer and had the same "garbage factory" problem, which is basically that there's a set of constant tokens (like json.Delim in this post) and then strings which cause allocations.

I came up with what I think is a kind of neat solution, which is that the tokenizer is generic over some T and takes a function from byteslice to T and uses T in place of the strings. This way, when the caller has some more efficient representation available (like one that allocates less) it can provide one, but I can still unit test the tokenizer with the identity function for convenience.

In a sense this is like fusing the tokenizer with the parser at build time, but the generic allows layering the tokenizer such that it doesn't know about the parser's representation.

[1] https://github.com/evmar/n2

by ncruceson 11/5/2023, 6:49 PM

It's possible to improve over the standard library with better API design, but it's not really possible to do a fully streaming parser that doesn't half fill structures before finding an error and bailing out in the middle, which is another explicit design constraint for the standard library.

by crabboneon 11/5/2023, 11:29 PM

Maybe I overlooked something, but the author keeps repeating that they wrote a "streaming" parser, but never explained what that actually means. In particular, they never explained how did they deal with repeating keys in "hash tables". What does their parser do? Calls the "sink" code twice with the repeated key? Waits until the entire "hash table" is red and then calls the "sink" code?

In my mind, JSON is inherently inadequate for streaming because of hierarchical structure, no length know upfront and most importantly, repeating keys. You could probably make a subset of JSON more streaming-friendly, but at this point, why bother? I mean, if the solution is to modify JSON, then a better solution would be something that's not JSON at all.

by rexfuzzleon 11/5/2023, 1:23 PM

Great to see a shout out to Phil Pearl! Also worth looking at https://github.com/bytedance/sonic

by kevingaddon 11/5/2023, 1:56 PM

I'm surprised there's no way to say 'I really mean it, inline this function' for the stuff that didn't inline because it was too big.

The baseline whitespace count/search operation seems like it would be MUCH faster if you vectorized it with SIMD, but I can understand that being out of scope for the author.

by mgaunardon 11/5/2023, 4:32 PM

"It’s unrealistic to expect to have the entire input in memory" -- wrong for most applications

by peterohleron 11/5/2023, 2:05 PM

You might want to take a look at https://github.com/ohler55/ojg. It takes a different approach with a single pass parser. There are some performance benchmarks included on the README.md landing page.

by forrestthewoodson 11/5/2023, 6:15 PM

> Any (useful) JSON decoder code cannot go faster that this.

That line feels like a troll. Cunningham’s Law in action.

You can definitely go faster than 2 Gb/sec. In a word, SIMD.

by 1vuio0pswjnm7on 11/5/2023, 9:44 PM

"But there is a better trick that we can use that is more space efficient than this table, and is sometimes called a computed goto."

From 1989:

https://raw.githubusercontent.com/spitbol/x32/master/docs/sp...

"Indirection in the Goto field is a more powerful version of the computed Goto which appears in some languages. It allows a program to quickly perform a multi-way control branch based on an item of data."

by arun-mani-jon 11/5/2023, 2:37 PM

I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.

I don't know how "true" that comment is but I thought I should try to write a parser myself to get a feel :D

So I wrote one, in Python - https://arunmani.in/articles/silly-json-parser/

It was a delightful experience though, writing and testing to break your own code with different variety of inputs. :)

by wslhon 11/5/2023, 10:51 PM

I remember this JSON benchmark page from RapidJSON [1].

[1] https://rapidjson.org/md_doc_performance.html

by romsharkon 11/6/2023, 7:30 AM

I've recently held a talk (https://youtu.be/a7VBbbcmxyQ?si=0fGVxfc4qmKMVCXk) about github.com/romshark/jscan that I've been working on. It's a performance-oriented JSON iterator / tokenizer you might want to take a look at if interested in high performance zero allocation JSON parsing in Go.

by isuckatcodingon 11/5/2023, 4:32 PM

This is fantastically useful.

Funny enough I stumbled upon your article just yesterday through google search.

by rurbanon 11/6/2023, 12:50 PM

This is a very poor and overly simplified text to write basic JSON parsers, not touching any topic of writing actually fast JSON parsers. Such as not-copying tokenizers (e.g. jsmn), word-wise tokenizers (simdjson) and fast numeric conversions (fast_double_parser at al).

by EdwardDiegoon 11/6/2023, 7:45 AM

A person who helped me out a lot when I was learning to code wrote his own .NET JSON library because the MS provided one had a rough API and was quite slow.

His lib became the defacto JSON lib in .NET dev and naturally, MS head-hunted him.

Fast JSON is that important these days.

by nwpierceon 11/5/2023, 5:09 PM

Writing a json parser is definitely an educational experience. I wrote one this summer for my own purposes that is decently fast: https://github.com/nwpierce/jsb

by suzzer99on 11/5/2023, 5:54 PM

Can someone explain to me why JSON can't have comments or trailing commas? I really hope the performance gains are worth it, because I've lost 100s of man-hours to those things, and had to resort to stuff like this in package.json:

  "IMPORTANT: do not run the scripts below this line, they are for CICD only": true,

by mleonhardon 11/7/2023, 4:47 AM

I wrote a Rust library that works similarly to the author's byteReader: https://crates.io/crates/fixed-buffer

by mannyvon 11/5/2023, 3:42 PM

These are always interesting to read because you get to see runtime quirks. I'm surprised there was so much function call overhead, for example. And it's interesting you can bypass range checkong.

The most important thing, though, is the process: measure then optimize.

by flaieon 11/6/2023, 8:29 AM

This was a very good read, and I did learn some nice tricks, thank you very much.

by mikhailfrancoon 11/8/2023, 3:35 PM

I notice 'sample.json' contains quite a few escaped nulls \u0000 inside quoted strings.

Is "\u0000" legal JSON?

P.S. ... and many other control characters < \u0020

by cratermoonon 11/6/2023, 3:56 AM

My favorite bit about this is his reference to John Ousterhout, Define errors out of existence. youtu.be/bmSAYlu0NcY?si=WjC1ouEN1ad2OWjp&t=1312

Note the distinct lack of:

    if err != nil {

by hintymadon 11/5/2023, 6:54 PM

How is this compared to Daniel Lemire's simdjson? https://github.com/simdjson/simdjson

by thomasvnon 11/6/2023, 4:44 PM

In what cases would an application need to regularly parse gigabytes of JSON? Wouldn't it be advantageous for the app to get that data into a DB?

by denysvitalion 11/5/2023, 1:14 PM

Also interesting: https://youtu.be/a7VBbbcmxyQ

by lamontcgon 11/5/2023, 5:18 PM

Wish I wasn't 4 or 5 uncompleted projects deep right now and had the time to rewrite a monkey parser using all these tricks.

by hknmtton 11/6/2023, 7:43 AM

what does this bring over goccy's json encoder?

by visargaon 11/5/2023, 2:53 PM

nowadays I am more interested in a "forgiving" JSON/YAML parser, that would recover from LLM errors, is there such a thing?

Building a high performance JSON parser