通过 ID 删除数百万行的最佳方法

我需要从我的 PG 数据库中删除大约200万行。我有一个需要删除的 ID 列表。然而,不管我怎么做,都要花上好几天。

我试着把它们放在一张桌子上,然后分100批做。4天后,这仍然在运行,只删除了297268行。(我必须从 ID 表中选择100个 ID,删除该列表中的位置,从 ID 表中删除我选择的100)。

我试过:

DELETE FROM tbl WHERE id IN (select * from ids)

那也要花很长时间。很难估计有多长时间,因为我不能看到它的进展,直到完成,但查询仍然运行了2天。

只是在寻找从表中删除的最有效的方法当我知道要删除的特定 ID 的时候,有成千上万的 ID。

86547 次浏览

The easiest way to do this would be to drop all your constraints and then do the delete.

You may try copying all the data in the table except the IDs you want to delete onto a new table, then renaming then swapping the tables (provided you have enough resources to do it).

This is not an expert advice.

Two possible answers:

  1. Your table may have lots of constraint or triggers attached to it when you try to delete a record. It will incur much processor cycles and checking from other tables.

  2. You may need to put this statement inside a transaction.

First make sure you have an index on the ID fields, both in the table you want to delete from and the table you are using for deletion IDs.

100 at a time seems too small. Try 1000 or 10000.

There's no need to delete anything from the deletion ID table. Add a new column for a Batch number and fill it with 1000 for batch 1, 1000 for batch 2, etc. and make sure the deletion query includes the batch number.

It all depends ...

  • Assuming no concurrent write access to involved tables or you may have to lock tables exclusively or this route may not be for you at all.

  • Delete all indexes (possibly except the ones needed for the delete itself).
    Recreate them afterwards. That's typically much faster than incremental updates to indexes.

  • Check if you have triggers that can safely be deleted / disabled temporarily.

  • Do foreign keys reference your table? Can they be deleted? Temporarily deleted?

  • Depending on your autovacuum settings it may help to run VACUUM ANALYZE before the operation.

  • Some of the points listed in the related chapter of the manual Populating a Database may also be of use, depending on your setup.

  • If you delete large portions of the table and the rest fits into RAM, the fastest and easiest way may be this:

BEGIN; -- typically faster and safer wrapped in a single transaction


SET LOCAL temp_buffers = '1000MB'; -- enough to hold the temp table


CREATE TEMP TABLE tmp AS
SELECT t.*
FROM   tbl t
LEFT   JOIN del_list d USING (id)
WHERE  d.id IS NULL;      -- copy surviving rows into temporary table
-- ORDER BY ?             -- optionally order favorably while being at it


TRUNCATE tbl;             -- empty table - truncate is very fast for big tables


INSERT INTO tbl
TABLE tmp;        -- insert back surviving rows.


COMMIT;

This way you don't have to recreate views, foreign keys or other depending objects. And you get a pristine (sorted) table without bloat.

Read about the temp_buffers setting in the manual. This method is fast as long as the table fits into memory, or at least most of it. The transaction wrapper defends against losing data if your server crashes in the middle of this operation.

Run VACUUM ANALYZE afterwards. Or (typically not necessary after going the TRUNCATE route) VACUUM FULL ANALYZE to bring it to minimum size (takes exclusive lock). For big tables consider the alternatives CLUSTER / pg_repack or similar:

For small tables, a simple DELETE instead of TRUNCATE is often faster:

DELETE FROM tbl t
USING  del_list d
WHERE  t.id = d.id;

Read the Notes section for TRUNCATE in the manual. In particular (as Pedro also pointed out in his comment):

TRUNCATE cannot be used on a table that has foreign-key references from other tables, unless all such tables are also truncated in the same command. [...]

And:

TRUNCATE will not fire any ON DELETE triggers that might exist for the tables.

We know the update/delete performance of PostgreSQL is not as powerful as Oracle. When we need to delete millions or 10's of millions of rows, it's really difficult and takes a long time.

However, we can still do this in production dbs. The following is my idea:

First, we should create a log table with 2 columns - id & flag (id refers to the id you want to delete; flag can be Y or null, with Y signifying the record is successfully deleted).

Later, we create a function. We do the delete task every 10,000 rows. You can see more details on my blog. Though it's in Chinese, you can still can get the info you want from the SQL code there.

Make sure the id column of both tables are indexes, as it will run faster.

If the table you're deleting from is referenced by some_other_table (and you don't want to drop the foreign keys even temporarily), make sure you have an index on the referencing column in some_other_table!

I had a similar problem and used auto_explain with auto_explain.log_nested_statements = true, which revealed that the delete was actually doing seq_scans on some_other_table:

    Query Text: SELECT 1 FROM ONLY "public"."some_other_table" x WHERE $1 OPERATOR(pg_catalog.=) "id" FOR KEY SHARE OF x
LockRows  (cost=[...])
->  Seq Scan on some_other_table x  (cost=[...])
Filter: ($1 = id)

Apparently it's trying to lock the referencing rows in the other table (which shouldn't exist, or the delete will fail). After I created indexes on the referencing tables, the delete was orders of magnitude faster.

I just hit this issue myself and for me the, by far, fastest method was using WITH Queries in combination with USING

Basically the WITH-query creates a temporary table with the primary keys to delete in the table you want to delete from.

WITH to_delete AS (
SELECT item_id FROM other_table WHERE condition_x = true
)
DELETE FROM table
USING to_delete
WHERE table.item_id = to_delete.item_id
AND NOT to_delete.item_id IS NULL;

Ofcourse the SELECT inside the WITH-query can be as complex as any other select with multiple joins etc. It just has to return one or more columns that are used to identify the items in the target table that need to be deleted.

NOTE: AND NOT to_delete.item_id IS NULL most likely is not necessary, but I didn't dare to try.

Other things to consider are

  1. creating indexes on other tables referring to this one via foreign key. Which can reduce a delete taking hours to mere seconds in certain situations
  2. deferring constraint checks: It's not clear how much, if any improvement this achieves, but according to this it can increase performance. Downside is, if you have a foreign key violation you will learn it only at the very last moment.
  3. DANGEROUS but big possible boost: disable constaint checks and triggers during the delete

I created a procedure to do delete customers without orders in batches of 250k. A procedure is not faster per se, but you can start and stop it without losing deletions that are already committed and you can resume it later (eg. if you have short maintenance windows).

CREATE OR REPLACE PROCEDURE delete_customer()
LANGUAGE plpgsql
AS $$
BEGIN
ALTER TABLE customer DISABLE trigger all;
ALTER TABLE order DISABLE trigger all;
WHILE EXISTS (SELECT FROM customer
WHERE NOT EXISTS (SELECT FROM order WHERE order.customer_id = customer.id) LIMIT 1)
LOOP
DELETE FROM customer WHERE customer.id IN
(SELECT customer.id FROM customer
WHERE NOT EXISTS (SELECT FROM order WHERE order.customer_id = customer.id) LIMIT 250000);
COMMIT;
END LOOP;
ALTER TABLE customer ENABLE trigger all;
ALTER TABLE order ENABLE trigger all;
END;
$$;
CALL delete_customer(); --start procedure
SELECT * FROM pg_stat_activity WHERE state = 'active'; --find pid of procedure
SELECT pg_cancel_backend(<pid>); --stop procedure

Make sure the triggers are re-enabled if you stop the procedure by hand. Disabling the triggers gives real performance improvements as mentioned by @Erwin Brandstetter, but was only possible for me in a short maintenance window.

I do a delete millions rows incrementally in batches with minimal locks by one procedure loop_execute(). There is a progress of execution in percent and a prediction of the end work time!