Using machine learning to choose compression algorithms

by kootenpvon 12/8/2019, 6:49 PMwith 58 comments

by pedrocx486on 12/8/2019, 7:39 PM

Is there a way to do the reverse?

There's a quite "legendary" Game Boy Advance game out there (Klonoa - Densetsu no Star Medal) that never got a translation to English because it has some sort of in-house created compression by Namco applied to the game that was made so it could fit into a GBA cartridge. AFAIK no one was ever able to crack it open and release the code to de/compress it.

A while ago I had a "bounty" of USD100 for anyone that could do it (just the decompression and re-compression, not translation) but there aren't many people that want to fiddle with low-level GBA coding.

by mappuon 12/8/2019, 11:19 PM

Are all compressors simply being run with their default arguments? There's a lot of scope for speed/filesize tradeoffs within a single compressor.

EDIT: You are missing `csv+zstd` ? It should obsolete `csv+gzip` at all speeds and compression levels.

There is a pareto-optimality frontier here - I ran my testing back in 2016 https://code.ivysaur.me/compression-performance-test/ but the numbers are now a little bit obsolete (e.g. zstd and brotli have both seen a lot of improvements).

by jedimasterton 12/8/2019, 8:05 PM

A somewhat different example but related (at least by name) but I recall an article from Chris Wellons' blog, Null Program, where he wanted to "discover" a new hashing algorithm, so he randomly generated them, JT compiled them to native, then tested them.

There was, how ever, no machine learning or optimizing. Instead, he called it "prospecting" and just generate a new one from scratch each time until he found something interesting.

https://nullprogram.com/blog/2018/07/31/

by md_on 12/8/2019, 8:06 PM

When and why would you want to apply ML to this problem?

Off the cuff, this seems like it competes with an alternative of simply running every considered compression algorithm and choosing the optimal one. I guess this would be advantageous if the RF classifier is meaningfully faster to run than the compression algorithms themselves. Is it?

by gwernon 12/8/2019, 7:31 PM

Isn't one of the benefits of time-series databases more compact storage at minimal overhead?

by WoodenChairon 12/9/2019, 2:45 AM

We did a fun experiment in Chapter 5 of Classic Computer Science Problems in Python. We ran a genetic algorithm to find what order of a list (assuming the order doesn't matter) of names led to the best compression using one of the Python standard library's built-in compression algorithms. This article has made me interested in re-running the experiment with multiple compression algorithms.

by d_burfooton 12/9/2019, 5:44 AM

Why not use ML to build a statistical model of your data, and then use that model to compress it? You will likely get better codelengths for your specific dataset than off-the-shelf algorithms can achieve. And if you figure out a good model that describes the fluctuations in the cryptocurrency prices very well, then you can use that model pretty directly to make money via automated trading.

I've got an easy-to-use library for arithmetic encoding in Java, it would be easy to port to other languages: https://github.com/comperical/MirrorEncode

by js8on 12/8/2019, 7:55 PM

How does it compare to context mixing algorithms such as https://en.wikipedia.org/wiki/ZPAQ ?

by proverbialbunnyon 12/8/2019, 8:33 PM

Is it possible to run data through a DNN, making sure the output is the same as the input (or close for lossless data), then take an autoencoder variant, record the 'compressed' data, then on the other end have the other half of the DNN to decompress it?

I'm pretty ignorant on the topic, so I get that that may be off, but if so, why wouldn't that be a valid solution to compressing data?

by numbolon 12/9/2019, 4:54 PM

Funny, it deeply resembles the last episodes of "Silicon Valley"

More seriously, the generalized compression algorithm can be one of the keys or even one of the definitions of generalized artificial intelligence

And Kolmogorov’s complexity was present by a subtle reference in the very last episode ...

by mebron 12/9/2019, 6:33 PM

The wording could be improved to make it easier understand what is being offered more quickly.

This is using machine learning to predict on the given heterogeneous tabular data which compression algorithm will yield a higher compression rate.

There are multiple other ways to do this approximation.

by mattnewporton 12/8/2019, 10:21 PM

From the data shown it's not obvious that some of the options would ever be best, would be interesting to see if any of the non csv options ever beat the csv options and for what type of data.