The walkthrough is very nice, how to do this if you're going to do it.
If you're going for pure performance in a production environment you might take a look at Daniel Lemire's work: https://github.com/simdjson/simdjson. Or the MinIO port of it to Go: https://github.com/minio/simdjson-go.
My own lessons from writing fast json parsers has a lot of language-type things but here are some generalisations:
Avoid heap allocations in tokenising. Have a tokeniser that is a function that returns a stack-allocated struct or an int64 token that is a packed field describing the start, length and type offsets etc of the token.
Avoid heap allocations in parsing: support a getString(key String) type interface for clients that what to chop up a buffer.
For deserialising to object where you know the fields at compile time, generally generate a switch case of key length before comparing string values.
My experience in data pipelines that process lots of json is that choice of json library can be a 3-10x performance difference and that all the main parsers want to allocate objects.
If the classes you are serialising or deserialising is known at compile time then Jackson Java does a good job but you can get a 2x boost with careful coding and profiling.
Whereas if you are paying aribrary json then all the mainstream parsers want to do lots of allocations that a more intrusive parser that you write yourself can avoid, and that you can make massive performance wins if you are processing thousands or millions of objects per second.
I've taken a very similar approach and built a GraphQL tokenizer and parser (amongst many other things) that's also zero memory allocations and quite fast. In case you'd like to check out the code: https://github.com/wundergraph/graphql-go-tools
In n2[1] I needed a fast tokenizer and had the same "garbage factory" problem, which is basically that there's a set of constant tokens (like json.Delim in this post) and then strings which cause allocations.
I came up with what I think is a kind of neat solution, which is that the tokenizer is generic over some T and takes a function from byteslice to T and uses T in place of the strings. This way, when the caller has some more efficient representation available (like one that allocates less) it can provide one, but I can still unit test the tokenizer with the identity function for convenience.
In a sense this is like fusing the tokenizer with the parser at build time, but the generic allows layering the tokenizer such that it doesn't know about the parser's representation.
It's possible to improve over the standard library with better API design, but it's not really possible to do a fully streaming parser that doesn't half fill structures before finding an error and bailing out in the middle, which is another explicit design constraint for the standard library.
Maybe I overlooked something, but the author keeps repeating that they wrote a "streaming" parser, but never explained what that actually means. In particular, they never explained how did they deal with repeating keys in "hash tables". What does their parser do? Calls the "sink" code twice with the repeated key? Waits until the entire "hash table" is red and then calls the "sink" code?
In my mind, JSON is inherently inadequate for streaming because of hierarchical structure, no length know upfront and most importantly, repeating keys. You could probably make a subset of JSON more streaming-friendly, but at this point, why bother? I mean, if the solution is to modify JSON, then a better solution would be something that's not JSON at all.
Great to see a shout out to Phil Pearl! Also worth looking at https://github.com/bytedance/sonic
I'm surprised there's no way to say 'I really mean it, inline this function' for the stuff that didn't inline because it was too big.
The baseline whitespace count/search operation seems like it would be MUCH faster if you vectorized it with SIMD, but I can understand that being out of scope for the author.
"It’s unrealistic to expect to have the entire input in memory" -- wrong for most applications
You might want to take a look at https://github.com/ohler55/ojg. It takes a different approach with a single pass parser. There are some performance benchmarks included on the README.md landing page.
> Any (useful) JSON decoder code cannot go faster that this.
That line feels like a troll. Cunningham’s Law in action.
You can definitely go faster than 2 Gb/sec. In a word, SIMD.
"But there is a better trick that we can use that is more space efficient than this table, and is sometimes called a computed goto."
From 1989:
https://raw.githubusercontent.com/spitbol/x32/master/docs/sp...
"Indirection in the Goto field is a more powerful version of the computed Goto which appears in some languages. It allows a program to quickly perform a multi-way control branch based on an item of data."
I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.
I don't know how "true" that comment is but I thought I should try to write a parser myself to get a feel :D
So I wrote one, in Python - https://arunmani.in/articles/silly-json-parser/
It was a delightful experience though, writing and testing to break your own code with different variety of inputs. :)
I remember this JSON benchmark page from RapidJSON [1].
I've recently held a talk (https://youtu.be/a7VBbbcmxyQ?si=0fGVxfc4qmKMVCXk) about github.com/romshark/jscan that I've been working on. It's a performance-oriented JSON iterator / tokenizer you might want to take a look at if interested in high performance zero allocation JSON parsing in Go.
This is fantastically useful.
Funny enough I stumbled upon your article just yesterday through google search.
This is a very poor and overly simplified text to write basic JSON parsers, not touching any topic of writing actually fast JSON parsers. Such as not-copying tokenizers (e.g. jsmn), word-wise tokenizers (simdjson) and fast numeric conversions (fast_double_parser at al).
A person who helped me out a lot when I was learning to code wrote his own .NET JSON library because the MS provided one had a rough API and was quite slow.
His lib became the defacto JSON lib in .NET dev and naturally, MS head-hunted him.
Fast JSON is that important these days.
Writing a json parser is definitely an educational experience. I wrote one this summer for my own purposes that is decently fast: https://github.com/nwpierce/jsb
Can someone explain to me why JSON can't have comments or trailing commas? I really hope the performance gains are worth it, because I've lost 100s of man-hours to those things, and had to resort to stuff like this in package.json:
"IMPORTANT: do not run the scripts below this line, they are for CICD only": true,
I wrote a Rust library that works similarly to the author's byteReader: https://crates.io/crates/fixed-buffer
These are always interesting to read because you get to see runtime quirks. I'm surprised there was so much function call overhead, for example. And it's interesting you can bypass range checkong.
The most important thing, though, is the process: measure then optimize.
This was a very good read, and I did learn some nice tricks, thank you very much.
I notice 'sample.json' contains quite a few escaped nulls \u0000 inside quoted strings.
Is "\u0000" legal JSON?
P.S. ... and many other control characters < \u0020
My favorite bit about this is his reference to John Ousterhout, Define errors out of existence. youtu.be/bmSAYlu0NcY?si=WjC1ouEN1ad2OWjp&t=1312
Note the distinct lack of:
if err != nil {
How is this compared to Daniel Lemire's simdjson? https://github.com/simdjson/simdjson
In what cases would an application need to regularly parse gigabytes of JSON? Wouldn't it be advantageous for the app to get that data into a DB?
Also interesting: https://youtu.be/a7VBbbcmxyQ
Wish I wasn't 4 or 5 uncompleted projects deep right now and had the time to rewrite a monkey parser using all these tricks.
what does this bring over goccy's json encoder?
nowadays I am more interested in a "forgiving" JSON/YAML parser, that would recover from LLM errors, is there such a thing?
Looks pretty good! Even though I've written far too many JSON parsers already in my career, it's really nice to have a reference for how to think about making a reasonable, fast JSON parser, going through each step individually.
That said, I will say one thing: you don't really need to have an explicit tokenizer for JSON. You can get rid of the concept of tokens and integrate parsing and tokenization entirely. This is what I usually do since it makes everything simpler. This is a lot harder to do with something like the rest of ECMAscript since in something like ECMAscript you wind up needing look-ahead (sometimes arbitrarily large look-ahead... consider arrow functions: it's mostly a subset of the grammar of a parenthesized expression. Comma is an operator, and for default values, equal is an operator. It isn't until the => does or does not appear that you know for sure!)