DELETEs Are Difficult

by radimmon 11/25/2024, 6:46 AMwith 115 comments

by jandrewrogerson 12/1/2024, 8:07 AM

DELETE is expensive at a deep fundamental level that we don’t think about much in computer science because we are more worried about losing data. The article is about Postgres but it generalizes. We don’t actually have any computer science for DELETE optimized databases. I’ve idly looked into delete-optimization in databases as thought experiments, since there isn’t much in the way of literature on it, and it is far more difficult than I think many people intuit. The article is partly a manifestation of this reality.

I think the nature of DELETE is one of the more interesting open problems in computer science. It is one of those things that, when we require precision, turns out to be very difficult to define.

by xg15on 12/1/2024, 9:44 AM

> For example, deleting 1 million rows in a single transaction is a textbook case of what not to do. Instead, splitting the operation into smaller batches, such as deleting 10,000 rows across 100 iterations, is far more effective.

Why do I as a user have to do that? Why can't the database implement batching internally and automatically transform my 1-million-rows query into an appropriate list of batched queries?

(Edit: Thanks a lot for the answers, that makes more sense - in particular the point that this would also lock one million rows at once)

by aurareturnon 11/25/2024, 6:50 AM

  DELETE FROM films;
I'm surprised databases makes it so easy to just delete an entire table. I think the command should be

  DELETE FROM films YES-I-KNOW-WHAT-I-AM-DOING;

by Svipon 12/1/2024, 8:40 AM

> Unlike DELETEs, UPDATEs don’t trigger cascaded actions - they only involve triggers that are explicitly defined.

That's not entirely true. ON UPDATE CASCADE is a thing for foreign keys, at least in PostgreSQL (which this article is talking about), meaning that the foreign row referencing the row gets updated. Though, personally, I would never use ON UPDATE CASCADE, as it seems kind of funky.

by Terr_on 11/25/2024, 8:50 AM

For many of the most painful deletion questions, the root problem is that when the software was first made the stakeholders/product-org didn't think about use-cases for deleting things. At best, they assume a "do not show" property can be placed onto things, which falls apart when you get to legal issues that compel actual removal.

by yen223on 12/1/2024, 7:44 AM

A big asterisk that should be added to the article is that all that applies to Postgres.

Other databases have their own deletion mechanisms (and deletion quirks)

It's a very good article otherwise.

by physicsguyon 12/1/2024, 8:55 AM

They are difficult but it shouldn’t be underestimated the cost of trying to keep consistency with the alternatives either.

I sometimes think that people ask the wrong question on this sort of thing - rather than thinking “what technical solution should I come up with” you should be thinking “what is the business requirement here, and what consistency guarantees are needed?”

In many cases you want to soft delete anyway and mark rows as stale rather than deleting them wholesale. Cascade deletes need to be very carefully thought about, as while they’re very handy they can be quite destructive if the relationships are not mapped out.

Personally having spent some time now in the microservices hole, I miss all the power SQL databases give you for this sort of thing. I think everyone should spend some time reading and digesting ‘Designing Data Intensive Applications’ and evaluating the trade offs in detail.

by juntoon 12/1/2024, 7:55 PM

Reminds me of this old post about deleting large amounts of data efficiently at MySpace. Page has now gone but was archived.

https://web.archive.org/web/20090525233504/http://blogs.msdn...

Brent Ozar talked about this back in 2008 in reference to working with large tables in MSSQL Server:

https://www.brentozar.com/archive/2018/04/how-to-delete-just...

by pmarreckon 12/1/2024, 3:44 PM

My only recommendation would be, no matter which strategy you go with, cover it with tests to make sure the right information stays and the right information gets physically (or marked) deleted, and that data marked for deletion is invisible to the UI except via admin access.

But, indeed, proper deletion is surprisingly difficult, especially when you consider cascades on a complex table containing many defined foreign-key relationships.

by tonymeton 12/1/2024, 6:40 PM

Why do DBs perform Delete operations online? Wouldn’t it be better to soft-delete (at the table-space level ) and then run scheduled task to clean up the table spaces?

Similar to git. When you “delete” files they are just removed from the tree. It isn’t until later that all refs and reflog references have expired , and gc is run, that the objects are actually removed.

by orionblastaron 11/25/2024, 6:52 AM

Most databases I used have a Status column we could mark as active, inactive, or deleted. That way, you can see what records were marked as deleted and change them back in case of accidental deletion.

Keep record retention with the Date_Modified column so you can use SQL delete to remove those deleted records that are older than a year or so.

by cannibalXxxon 12/1/2024, 10:01 AM

this content reminds me of this post where we can get an idea of how to build complex and efficient queries. https://chat-to.dev/post?id=724

by redman25on 12/1/2024, 11:55 AM

One solution for performance degradation with soft deletes is to partition the table by some field like `created` monthly. Queries will need to include `created` in the query is the main downside.

by burntcaramelon 12/1/2024, 8:44 AM

If data isn’t actually removed until vacuuming, then are systems that perform SQL DELETES actually GDPR compliant? Because technically the private data is still there on disk and could be recovered. “Until the autovacuum process or a manual VACUUM operation reclaims the space, the “deleted” data remains.”