Using Parquet's Bloom Filters

by pauldixon 5/28/2024, 7:28 PMwith 7 comments

by appplicationon 5/29/2024, 1:22 PM

One thing I have wondered: would it make sense to reduce file size? Generally advice I’ve seen is to keep files to around 250mb-1gb, but if you’re leaning heavily on bloom filters it feels like it could make sense to reduce the number of files to reduce the amount that would trigger the per-file filter.

by darkflame91on 5/29/2024, 5:30 AM

With large datasets, wouldn't partitioning the data on low cardinality columns give the same benefit without the space overhead?