> For example, deleting 1 million rows in a single transaction is a textbook case of what not to do. Instead, splitting the operation into smaller batches, such as deleting 10,000 rows across 100 iterations, is far more effective.
Why do I as a user have to do that? Why can't the database implement batching internally and automatically transform my 1-million-rows query into an appropriate list of batched queries?
(Edit: Thanks a lot for the answers, that makes more sense - in particular the point that this would also lock one million rows at once)
DELETE FROM films;
I'm surprised databases makes it so easy to just delete an entire table. I think the command should be DELETE FROM films YES-I-KNOW-WHAT-I-AM-DOING;
> Unlike DELETEs, UPDATEs donât trigger cascaded actions - they only involve triggers that are explicitly defined.
That's not entirely true. ON UPDATE CASCADE is a thing for foreign keys, at least in PostgreSQL (which this article is talking about), meaning that the foreign row referencing the row gets updated. Though, personally, I would never use ON UPDATE CASCADE, as it seems kind of funky.
For many of the most painful deletion questions, the root problem is that when the software was first made the stakeholders/product-org didn't think about use-cases for deleting things. At best, they assume a "do not show" property can be placed onto things, which falls apart when you get to legal issues that compel actual removal.
A big asterisk that should be added to the article is that all that applies to Postgres.
Other databases have their own deletion mechanisms (and deletion quirks)
It's a very good article otherwise.
They are difficult but it shouldnât be underestimated the cost of trying to keep consistency with the alternatives either.
I sometimes think that people ask the wrong question on this sort of thing - rather than thinking âwhat technical solution should I come up withâ you should be thinking âwhat is the business requirement here, and what consistency guarantees are needed?â
In many cases you want to soft delete anyway and mark rows as stale rather than deleting them wholesale. Cascade deletes need to be very carefully thought about, as while theyâre very handy they can be quite destructive if the relationships are not mapped out.
Personally having spent some time now in the microservices hole, I miss all the power SQL databases give you for this sort of thing. I think everyone should spend some time reading and digesting âDesigning Data Intensive Applicationsâ and evaluating the trade offs in detail.
Reminds me of this old post about deleting large amounts of data efficiently at MySpace. Page has now gone but was archived.
https://web.archive.org/web/20090525233504/http://blogs.msdn...
Brent Ozar talked about this back in 2008 in reference to working with large tables in MSSQL Server:
https://www.brentozar.com/archive/2018/04/how-to-delete-just...
My only recommendation would be, no matter which strategy you go with, cover it with tests to make sure the right information stays and the right information gets physically (or marked) deleted, and that data marked for deletion is invisible to the UI except via admin access.
But, indeed, proper deletion is surprisingly difficult, especially when you consider cascades on a complex table containing many defined foreign-key relationships.
Why do DBs perform Delete operations online? Wouldnât it be better to soft-delete (at the table-space level ) and then run scheduled task to clean up the table spaces?
Similar to git. When you âdeleteâ files they are just removed from the tree. It isnât until later that all refs and reflog references have expired , and gc is run, that the objects are actually removed.
Most databases I used have a Status column we could mark as active, inactive, or deleted. That way, you can see what records were marked as deleted and change them back in case of accidental deletion.
Keep record retention with the Date_Modified column so you can use SQL delete to remove those deleted records that are older than a year or so.
this content reminds me of this post where we can get an idea of how to build complex and efficient queries. https://chat-to.dev/post?id=724
One solution for performance degradation with soft deletes is to partition the table by some field like `created` monthly. Queries will need to include `created` in the query is the main downside.
If data isnât actually removed until vacuuming, then are systems that perform SQL DELETES actually GDPR compliant? Because technically the private data is still there on disk and could be recovered. âUntil the autovacuum process or a manual VACUUM operation reclaims the space, the âdeletedâ data remains.â
DELETE is expensive at a deep fundamental level that we donât think about much in computer science because we are more worried about losing data. The article is about Postgres but it generalizes. We donât actually have any computer science for DELETE optimized databases. Iâve idly looked into delete-optimization in databases as thought experiments, since there isnât much in the way of literature on it, and it is far more difficult than I think many people intuit. The article is partly a manifestation of this reality.
I think the nature of DELETE is one of the more interesting open problems in computer science. It is one of those things that, when we require precision, turns out to be very difficult to define.