On the data warehousing side, I think the story looks like this:
1) Cloud data warehouses like Redshift, Snowflake, and BigQuery proved to be quite good at handling very large datasets (petabytes) with very fast querying.
2) Customers of these proprietary solutions didn't want to be locked in. So many are drifting toward Iceberg tables on top of Parquet (columnar) data files.
Another "hidden" motive here is that Cloud object stores give you regional (multi-zonal) redundancy without having to pay extra inter-zonal fees. An OLTP database would likely have to pay this cost, as it likely won't be based purely on object stores - it'll need a fast durable medium (disk), if at least for the WAL or the hot pages. So here we see the topology of Cloud object stores being another reason forcing the split between OLTP and OLAP.
But how does this new world of open OLTP/OLAP technologies look like? Pretty complicated.
1) You'd probably run PostGres as your OLTP DB, as it's the default these days and scales quite well.
2) You'd set up an Iceberg/Parquet system for OLAP, probably on Cloud object stores.
3) Now you need to stream the changes from PostGres to Iceberg/Parquet. The canonical OSS way to do this is to set up a Kafka cluster with Kafka Connect. You use the Debezium CDC connector for Postgres to pull deltas, then write to Iceberg/Parquet using the Iceberg sink connector. This incurs extra compute, memory, network, and disk.
There's so many moving parts here. The ideal is likely a direct Postgres->Iceberg write flow built-into PostGres. The pg_mooncake this company is offering also adds DuckDB-based querying, but that's likely not necessary if you plan to use Iceberg-compatible querying engines anyway.
Ideally, you have one plugin for purely streaming PostGres writes to Iceberg with some defined lag. That would cut out the third bullet above.
Terrible scrolling aside;
> pg_mooncake is a PostgreSQL extension adding columnstore tables with DuckDB execution for 1000x faster analytics. Columnstore tables are stored as Iceberg or Delta Lake tables in your Object Store. Maintained by Mooncake Labs, it is available on Neon Postgres.
Seems to summarise the reason this article exists.
Not that I really disagree with the premise or conclusion of the article itself.
I'm skeptical of this. The cost of maintaining the "disaggregated data stack" can be immense at scale. A database that can handle replication from a row-based transactional store to, for example, a columnar one that can support aggregations could really reduce the load on engineering teams.
My work involves a "disaggregated data stack" and a ton of work goes into orchestrating all the streaming, handling drift, etc between the transactional stores (hbase) and the various indexes like ES. For low-latency OLAP queries, the data lakes can't always meet the need either. I haven't gotten the chance to see an HTAP database in action at scale, but it sounds very promising.
> Back in the â70s, one relational database did everything. Transactions (OLTP) during the day and reports after hours (OLAP). Databases like Oracle V2 and IBM DB2 ran OLTP and OLAP on the same system; largely because data sets still fit on a few disks and compute was costly.
The timeline is a bit off - Oracle V2 was released in second half of 1979, so although it technically came out at the very end of the 1970s, it isnât really representative of 1970s databases. Oracle V1 was never released commercially, it was used as an internal name while under development starting circa 1977, inside SDL (which renamed itself RSI in 1979, and then Oracle in 1983). Plus Larry Ellison wanted the first release to be version 2 because some people are hesitant to buy version 1 software. Oracle was named after a database project Ellison worked on for the CIA while employed at Ampex, although Iâm not sure anyone can really know exactly how much the abandoned CIA database system had in common with Oracle V1/V2, definitely taking some ideas from the CIA project but Iâm not sure if it took any of the actual code.
The original DB2 for MVS (later OS/390 and now z/OS) was released in 1983. The first IBM RDBMS to ship as a generally available commercial product was SQL/DS in 1981 (for VM/CMS), which this century was renamed DB2 for VM/VSE. I believe DB2/400 (now renamed DB2 for IBM i) came out with the AS/400 and OS/400 in 1988, although possibly there was already some SQL support in S/38 in the preceding years. The DB2 most people nowadays would encounter is the Linux/AIX/Windows edition (DB2 LUW) is a descendant of OS/2 EE Database Manager, which I think came out in 1987. Anyway, my point - the various editions of DB2 all saw their initial releases in the 1980s, not the 1970s.
While relational technology was invented as a research concept in the 1970s (including the SQL query language, and several now largely forgotten competitors), in that decade its use was largely limited to research, along with a handful of commercial pilots. General commercial adoption of RDBMS technology didnât happen until the 1980s.
The most common database technologies in the 1970s were flat file databases (such as ISAM and VSAM databases on IBM mainframes), hierarchical databases (such as IBM IMS), the CODASYL network model (e.g. IDS, IDMS), MUMPS (a key-value store with hierarchical keys), early versions of PICK, inverted list databases (ADABAS, Model 204, Datacom)-I think many (or even all) of these were more popular in the 1970s than any RDBMS. The first release of dBase came out in 1978 (albeit then called Vulcan, it wasnât named dBase until 1980)-but like Oracle, it falls into the category âtechnically released in late 1970s but didnât become popular until the 1980sâ
> Most workloads donât need distributed OLTP. Hardware got faster and cheaper. A single beefy machine can handle the majority of transactional workloads. Cursor and OpenAI are powered by a single-box Postgres instance. Youâll be just fine.
I thought this was such an important point. Sooooo many dev hours were spent figuring out how to do distributed writes, and for a lot of companies that work was never needed.
Don't worry. All architectures get recycled eventually. Everything is new again.
One of the biggest problems with having more data is it's just hard to manage. That's why cloud data warehouses are here to stay. They enable the "utility computing" of cloud compute providers, but for data. I don't think architecture is a serious consideration for most people using it, other than the idea that "we can just throw everything at it".
NewSQL didn't thrive because it isn't sexy enough. A thing doesn't succeed because it's a "superior technology", it survives if it's overwhelmingly more appealing than existing solutions. None of the NewSQL solutions are sufficiently sexier than old boring stable databases. This is the problem with every new database. I mean, sure, they're fun for a romp in the sheets; but are they gonna support your kids? Interest drops off once everyone realizes it's not overwhelmingly better than the old stuff. Humans are trend-seekers, but they also seek familiarity and safety.
I think people need to realize that HTAP it's not a technology but database features while relational is the real database technology.
It seems that now people is converging to this pseudo-math database solution namely Postgresql with its battle-hardened object-relational technology that's IMHO a local minima [1].
The world need a proper math based universal solution for the database technology similar to relational. But this time around we need much more features, we want it all including analytical, transaction, spreadsheet, graph, vector, signal, etc. On top of that we want reliable distributed architecture. We simply cannot add on indefinitely upon Postgresql because the complexity will be humongous and the solutions become sub-optimal [2].
We need strong database foundation with solid mathematical basis not unlike the original relational database technology.
The best candidate that's available now is D4M by the fine folks at MIT that has been implemented in Matlab, Python and Julia [3]. Perhaps someone need to write C++, Dlang or Rust version of it to be widely acceptable.
It's funny that the article started by mentioning the article inspiration was from the popular article on big data is dead and by doing so is prematurely dismissing the problem. The book on D4M, however embrace the big data problem by its head by putting the exact terminology it the title [4].
[1] Whatâs the Difference Between MySQL and PostgreSQL?
https://aws.amazon.com/compare/the-difference-between-mysql-...
[2] Just Use Postgres!
https://www.manning.com/books/just-use-postgres
[3] D4M: Dynamic Distributed Dimensional Data Model:
[4] Mathematics of Big Data: Spreadsheets, Databases, Matrices, and Graphs (MIT Lincoln Laboratory Series):
https://mitpress.mit.edu/9780262038393/mathematics-of-big-da...
That is the worst smooth scrolling hijack I've ever seen, and the whole site breaks if you disable javascript.
I think the upcoming CedarDB is HTAP? https://cedardb.com/
Clickhouse performance for Postgres workloads?
From a modern startupâs POV - fast pivots, fast feedback - itâs fair to say HTAP is âdead.â The market is sticky and slow-moving. But Iâd argue thatâs precisely why itâs still interesting: fewer teams can survive the long game, but the payoff can be disproportionate.
>Cursor is powered by a single-box Postgres instance
Why wouldn't it? The resources needed to run the backend of Cursor come from the compute for the AI models. Updating someone's quota in a database every few minutes is not going to be causing issues.
The HTAP vision was essentially built on the traditional notion that a database is a single 'place' where both transactions happen and complex queries run.
Rich Hickey argued [0] that place-orientation is bad and that a database should actually just be an immutable value which can be passed around freely. That's fairly in line with the conclusions of the post, although I think much more simplification of the disaggregated stack is possible.
[0] https://www.infoq.com/presentations/Deconstructing-Database/
> Cursor and OpenAI are powered by a single-box Postgres instance. Youâll be just fine.
Well no, not according to your own source:
This setup consists of one primary database and dozens of replicas.
Are they just fine? There have been several instances in the past where issues related to PostgreSQL have led to outages of ChatGPT.
OK but let's pretend it's acceptable to have outages. It's fine apart from that? However, âwrite requestsâ have become a major bottleneck. OpenAI has implemented numerous optimizations in this area, such as offloading write loads wherever possible and avoiding the addition of new services to the primary database.
I feel that! I've been part of projects where we've finished building a feature, but didn't let customers have it because it affected the write path and broke other features.It's been less than a week since someone in the company posted in Slack "we tried scaling up the db (Azure mssql) but it didn't fix the performance issues."
https://learn.microsoft.com/en-us/sql/relational-databases/i...
HTAP in sql server for reference.
The 2nd last line is the summary - "The HTAP challenge of our time comes down to making the lakehouse real-time ready."
We are building this platform as well. There are 2 aspects to it - the "enterprise way" and the "greenfield way". The greenfield way will win out in 10-15 years, but unless you have capital to last that long, as a startup we need to go the Enterprise way first until we are big enough to go the unified HTAP-style way. The Lakehouse - open columnar data - is here to stay. It needs a better connection to OLTP than Kafka, but it will take time between A and B.
I would say compute and storage separation is the way to go, especially for hyperscaler offering ala aurora db/cosmos/alloy. And later more opensource alternatives will catch up.
One thing none seem to notice is the rise of âOperational Warehousesâ such as RisingWave or Materialize. A big âproblemâ in OLAP, as the article mentions, is people expects aggregations or analytic views on live data. These solutions solve it. In principle, this shows that just having incrementally maintained materialised views, really goes a long way towards achieving the HTAP dream on a single DB.
I've always been impressed by the architecture of the Hyperscale service tier of MSSQL in Azure. It is arguably a competitor in this area.
https://learn.microsoft.com/en-us/azure/azure-sql/database/h...
Article is really messing up my browser so couldnt read on my phone. But htap never made sense to me be because in my experience its very rare that you'd need analytics on a single database. Its often a confluence of multiple datasources- streams, databases, csvs, vendor provided data .
I have compiled the following table to compare OLTP and OLAP
I found the title amusing. This died.. right after inception.
Clearly, the objectives and limitations of OLAP and OLTP differ so much that merging the two domains in a fantasy.
It's like asking two people to view through the same lens.
My takeaway about all this is that nobody really cares much about consistency or the cost to build and run lambda-like architectures.
I stopped reading early, when the article said that in the 1970s one big relational database did everything.
In fact, relational databases did nothing in the 1970s. They didn't even exist yet in commercial form.
My first prediction as an analyst from 1982 onwards was that "index-based" DBMS would take over from linked-list DBMS and flat files. (That was meant to cover both inverted-list and relational systems; I expected inverted-list DBMS to outperform relational ones for longer than they did.)
Never heard of it. Maybe it's a good thing it's being considered dead... /s
You cannot say HTAP is dead when the alternative is so much complexity and so many moving parts. Most enterprises are burning huge amounts of resources literally just shuffling data around for zero business value.
The dream is a single data mesh presenting an SQL userland where I can write and join data from across the business with high throughput and low latency. With that, I can kill off basically every microservice that exists, and work on stuff that matters at pace, instead of half of all projects being infrastructure churn. We are close but we are not there yet and I will be furious if people stop trying to reach this endgame.