从 sqlite 数据库中删除重复行

在 SQLite3中有一个巨大的表——3600万行。在这个非常大的表中,有两列:

  • 短信
  • d-real

有些行是重复的。也就是说,hashd具有相同的值。如果两个散列是相同的,那么 d的值也是相同的。然而,两个相同的 d并不意味着两个相同的 hash

我想删除重复的行。我没有一个主键列。

最快的方法是什么?

65093 次浏览

If adding a primary key is not an option, then one approach would be to store the duplicates DISTINCT in a temp table, delete all of the duplicated records from the existing table, and then add the records back into the original table from the temp table.

For example (written for SQL Server 2008, but the technique is the same for any database):

DECLARE @original AS TABLE([hash] varchar(20), [d] float)
INSERT INTO @original VALUES('A', 1)
INSERT INTO @original VALUES('A', 2)
INSERT INTO @original VALUES('A', 1)
INSERT INTO @original VALUES('B', 1)
INSERT INTO @original VALUES('C', 1)
INSERT INTO @original VALUES('C', 1)


DECLARE @temp AS TABLE([hash] varchar(20), [d] float)
INSERT INTO @temp
SELECT [hash], [d] FROM @original
GROUP BY [hash], [d]
HAVING COUNT(*) > 1


DELETE O
FROM @original O
JOIN @temp T ON T.[hash] = O.[hash] AND T.[d] = O.[d]


INSERT INTO @original
SELECT [hash], [d] FROM @temp


SELECT * FROM @original

I'm not sure if sqlite has a ROW_NUMBER() type function, but if it does you could also try some of the approaches listed here: Delete duplicate records from a SQL table without a primary key

You need a way to distinguish the rows. Based on your comment, you could use the special rowid column for that.

To delete duplicates by keeping the lowest rowid per (hash,d):

delete   from YourTable
where    rowid not in
(
select  min(rowid)
from    YourTable
group by
hash
,       d
)

I guess the fastest would be to use the very database for it: add a new table with the same columns, but with proper constraints (a unique index on hash/real pair?), iterate through the original table and try to insert records in the new table, ignoring constraint violation errors (i.e. continue iterating when exceptions are raised).

Then delete the old table and rename the new to the old one.

The proposed solution was not working for me, so I ended up doing this:

CREATE TABLE temp_table as SELECT DISTINCT * FROM your_table
DROP TABLE your_table
ALTER TABLE temp_table RENAME TO your_table