This looks awesome. One of my biggest gripe's personally with Iceberg (less-so Delta Lake, but similar) is how difficult it is to just try out on a laptop. Delta Lake has vanilla Python implementations, but those are fragemented and buggy IME. Iceberg has just never worked locally, you need a JVM cluster and a ton of setup. I went down a similar road of trying to use sqlite/postgres+duckdb+parquet files in blob storage, but it was a lot of work.
It seems like this will just work out of the box, and just scale up to very reasonable data sizes. And the work from the DuckDB folks is typically excellent. It's clear they understand this space. Excited to try it out!
They make a really good criticism of Iceberg: if we have a database anyway, why are we bothering to store metadata in files?
I don’t think DuckLake itself will succeed in getting adopted beyond DuckDB, but I would not be surprised if over time the catalog just absorbs the metadata, and the original Iceberg format fades into history as a transitional form.
Hopefully this clarifies the value proposition for others:
Existing Lakehouse systems like Iceberg store crucial table information (like schema and file lists) as many small "metadata files" in cloud object storage (like S3). Accessing these files requires numerous network calls, making operations like query planning and updating tables inefficient and prone to conflicts. DuckLake solves this by putting all that metadata into a fast, transactional SQL database, using a single query to get what's needed, which is much quicker and more reliable.
And a manifesto: https://ducklake.select/manifesto/
Iceberg competitor, addressing some of its shortcomings, like blown-up metadata:
https://quesma.com/blog-detail/apache-iceberg-practical-limi...
Even Snowflake was using FoundationDB for metadata, whereas Iceberg attempts to use blob storage even for the metadata layer.
I’m building a poor man’s datalake at work, basically putting parquet files in blob storage using deltalake-rs’ python bindings and duck db for querying.
However, I constantly run in to problems with concurrent writes. I have a cloud function triggered ever x minutes to pull data from API and that’s fine.
But if I need to run a backfill I risk that that process will run at the same time as the timer triggered function. Especially if I load my backfill queue with hundreds of runs that needs to be pulled and they start saturating the workers in the cloud function.
We've come full circle. If you want to build a database, then you need to build it like a database. Thank you DuckDB folks!
I wonder how this relates to Mother Duck (https://motherduck.com/)? They do „DuckDB-powered data warehousing“ but predate this substantially.
Announcement post https://duckdb.org/2025/05/27/ducklake.html (https://news.ycombinator.com/item?id=44106691)
There is a lot to like here, but once metadata is in the novel Ducklake format, it is hard to picture how you can get good query parallelism, which you need for large datasets. Iceberg already is well supported by lots of heavy-duty query engines and that support is important once you have lots and lots and lots of data.
My understanding was that MotherDuck was focusing on providing the "multiplayer mode" for DuckDB. It's interesting to see DuckDB Labs supporting data lakes natively. I guess MotherDuck is potentially moving to the UI layer by providing the notebook interface for DuckDB.
How do I integrate DuckLake with Apache Spark? Is it a format or a catalog?
Same question for presto, trino, dremio, snowflake, bigquery, etc.
I am a huge fan of what they are doing, particularly putting local compute front and center. However for “BigCorp”, it’s going to be an uphill battle. The incumbents are entrenched and many decision makers will make decisions based on non technical reasons (I.e did my sales exec get me to the F1 Grand Prix).
This looks very cool!
One thing I noticed is that the `table_stats` and `column_stats` tables aren't snapshot versioned. What are these used for and isn't that going to hurt timetravel queries (`SELECT COUNT(*) FROM tbl WHERE snapshot_id=<old_id>` as a simple example)?
It looks very promising, especially knowing DuckDB team is behind it. However I really don't understand how to insert data in it. Are we supposed to use DuckDB INSERT statement with any function to read external files or any other data ? Looks very cool though.
the manifesto [1] is the most interesting thing. I agree that DuckDB has the largest potential to disrupt the current order with Iceberg.
However, this mostly reads to me as thought experiment: > what if the backend service of an Iceberg catalog was just a SQL database?
The manifesto says that maintaining a data lake catalog is easier, which I agree with in theory. s3-files-as-information-schema presents real challenges!
But, what I most want to know is what's the end-user benefit?
What does someone get with this if they're already using Apache Polaris or Lakekeeper as their Iceberg REST catalog?
I love duckDB and this looks just absolutely brilliant!
One question for me is, lets say i want to start using this today and at work we are running snowflake. I get that each analytics person would have to run duckdb + this extension on their local machines and point to the blob store and the database that is running datalake extension, for now that would be say a VM running duckdb. When I run the actual query where does the computation happen? And what if I want a lot of computation?
Is the solution currently to host a huge duckdb VM that everyone ssh's into and run their queries or how does that part work?
Using SQL as catalog is not new (iceberg supports JDBC catalog from the very beginning).
The main difference is to store metadata and stats also directly in SQL databases, which makes perfect sense for smaller scale data. In fact we were doing something similar in https://github.com/Mooncake-Labs/pg_mooncake, metadata are stored in pg tables and only periodically flush to actual formats like iceberg.
How will DuckLake work with other Iceberg clients - like Python (Polars), Spark, Flink, etc?
Do you need to put a REST API in front of it this duckdb instance to make it an Iceberg Catalog?
Is there any information about updates to existing rows? The FAQ says "Similarly to other data lakehouse technologies, DuckLake does not support constraints, keys, or indexes."
However in Iceberg there are Copy-On-Write and Merge-On-Read strategies dealing with updates.
"DuckLake is also able to improve the two biggest performance problems of data lakes: small changes and many concurrent changes."
These I'd argue are not the natural use cases for a data lake, especially a design which uses multiple writers to a given table.
Not exactly sure what it's for? it's to stream your data to Parquet files on (eg) S3 and keep somewhere the exact schema at each point in time? or is it something else?
would be nice to have some tutorial/use-cases in the doc :)
Absolutely love DuckDB. We use it a ton in our product.
You can check out our sandbox at https://yeet.cx/play
Strangely I can't get to this domain. We have ZScaler at work with DGA Blocking enabled and it prevents me from loading the page.
What is a data lake?
Great idea, poor naming. If you’re aiming for a standard of sorts, tying it to a specific software by reusing its name feels counter productive.
“Ducklake DuckDB extension” really rolls off the tongue /s.
I didn't delve into the product, but why is it really different of Hive catalog searching parquet or csv files? What's the improvement?
I have an personal pet peeve about Parquet that is solved, incompatibly, by basically every "data lake / lakehouse" layer on top, and I'd love to see it become compatible: ranged partitioning.
I have an application which ought to be a near-perfect match for Parquet. I have a source of timestamped data (basically a time series, except that the intervals might not be evenly spaced -- think log files). A row is a timestamp and a bunch of other columns, and all the columns have data types that Parquet handles just fine [0]. The data accumulates, and it's written out in batches, and the batches all have civilized sizes. The data is naturally partitioned on some partition column, and there is only one writer for each value of the partition column. So far, so good -- the operation of writing a batch is a single file creation or create call to any object store. The partition column maps to the de-facto sort-of-standard Hive partitioning scheme.
Except that the data is (obviously) also partitioned on the timestamp -- each batch covers a non-overlapping range of timestamps. And Hive partitioning can't represent this. So none of the otherwise excellent query tools can naturally import the data unless I engage in a gross hack:
I could also partition on a silly column like "date". This involves aligning batches to date boundaries and also makes queries uglier.
I could just write the files and import ".parquet". This kills performance and costs lots of money.
I could use Iceberg or Delta Lake or whatever for the sole benefit that their client tools can handle ranged partitions. Gee thanks. I don't actually need any of the other complexity.
It would IMO be really really nice if everyone could come up with a directory-name or filename scheme for ranged partitioning.
[0] My other peeve is that a Parquet row and an Arrow row and a Thrift message and a protobuf message, etc, are almost* but not quite the same thing. It would be awesome if there was a companion binary format for a single Parquet row or a stream of rows so that tools could cooperate more easily on producing the data that eventually gets written into Parquet files.