Fast columnar JSON decoding with arrow-rs

by necubion 3/23/2025, 5:10 PMwith 7 comments

by jdfon 3/26/2025, 8:08 PM

It would be great if someone could implement the schema discovery algorithm from the DB research GOAT, Thomas Neumann, and add it to Apache Arrow: https://db.in.tum.de/~durner/papers/json-tiles-sigmod21.pdf

by vjerancrnjakon 3/24/2025, 11:31 PM

Given that schema is known, should be able to avoid general JSON parsing. Would be much faster.

by at0mic22on 3/24/2025, 8:44 PM

How does it compare with serde, which AFAIK uses the same approach

by atombenderon 3/24/2025, 11:46 PM

The benchmark section ("But is it fast?") contains a common error when trying to represent ratios as percentages.

For the "Tweets" case, it reports a speedup of 229%. The old value is 11.73 and the new is 5.108. That is a speedup of 2.293 (i.e. the new measurement is 2.293 times faster), but that is a difference of -56%, not 229%, so it's 129% faster, if you really want to use a comparative percentage.

Because using percentages to express ratio of change can be confusing or misleading, I always recommend using speedup instead, which is a simple ratio. A speedup of 2 is twice as fast. A speedup of 1 is the same. 0.5 is half as fast.

Formulas:

    speedup(old, new) = old / new

    relativePercent(old, new) = ((new / old) - 1) * 100

    differenceInPercent(old, new) = (new - old) / old * 100