tl;dr we built an embeddable stream processing engine in Rust using apache DataFusion, check us out at https://github.com/probably-nothing-labs/denormalized
Hey HN,
We’d like to showcase a very early version of our embeddable stream processing engine called Denormalized. The rise of DuckDB has abundantly made it clear that even for many workloads of Terabyte scale, a single node system outshines the distributed query engines of previous generation such as Spark, Snowflake etc in terms of both performance and cost.
Now a lot of workloads DuckDB is used for were normally considered to be “big data” in the previous generation, but no more. In the context of streaming especially, this problem is more acute. A streaming system is designed to incrementally process large amounts of data over a period of time. Even on the upper end of scale, productionized use-cases of stream processing are rarely performing compute on more than tens of gigabytes of data at a given time.
Even so, the standard stream processing solutions such as Flink involve spinning up a distributed JVM cluster to even compute against the simplest of event streams. To that end, we’re building Denormalized designed to be embeddable in your applications and scale up to hundreds of thousands of events per second with a Flink-like dataflow API. While we currently only support Rust, we have plans for Python and Typescript bindings soon.
We’re built atop DataFusion and the Arrow ecosystems and currently support streaming joins as well as windowed aggregations on Kafka topics.
Please check out out repo at: https://github.com/probably-nothing-labs/denormalized
We’d love to hear your feedback.
Other founder here -- we've been working on this now for several months and have had a lot of fun building on top of arrow and datafusion
Are you going to support OLAP use cases as well? I haven't yet found a really nice hybrid batch/streaming query engine with dataframe support.
Ideally, you'd support an api similar to Polars (which I have found to be the nicest thus far).
It'd also be important/useful to support Python udfs (think numpy/jax/etc.).
It'd be very cool if you could collaborate with or even tap into the polars frontend. If you could execute polars logical plans but with a streaming source, that would be huge.
I'd be curious to know what your thoughts on differential/timely dataflow are. Superficially it seems that it might be possible to integrate the existing Rust infrastructure from those libraries with DataFusion and Arrow, which could give you quite a few operators for free, and provide your users with the very nice incremental query/streaming-as-view-maintenance model.
Neat, founder of https://tonbo.io/ here, I am excited to see someone bring stream processing to datafusion, we are working on a arrow-native embedded db and plan to support datafusion in the next release, we’re interested in building the streaming feature on denormalized.
Interesting. What use cases are you guys targeting with this?
Congratulations on launching your project! We spoke back in March at a Kafka Summit London social meetup and talked all things Python and Kafka (I work on https://github.com/quixio/quix-streams). Always great to see a new stream processing project tackle a new segment
For someone not deep in the topic, what is a "Streaming Processing Engine"?
All the description for Denormalized use the term, so if don't know it, it's kind of impossible to understand what Denormalized is / trying to solve.
This looks totally awesome! Easy to setup, memory-efficient, streaming, real-time data aggregation, compilable to a single self contained binary, that is a dream come true.
Bookmarked for future projects!
Will be excited to see the typescript bindings once out. We may be able to use this to handle some of our workloads at Embra.
Will reach out! Congrats on the ship.
What differentiates you from i.e. Arroyo and Fluvio?
Can't wait for the Python SDK!
Do you have plans to make the data sources pluggable instead of being Kafka specific?
Nice! How feature complete is this with current industry standards like Flink?
Looks cool! I’ll try it out for my ambitious project :)
This looks super interesting. I built https://github.com/finos/perspective in a past life but have been out of the streaming analytics game for some time. Nice to see single machine efficiency be a focus, will give this a try and post feedback on github.