PostgreSQL 中的批量/批量更新/upsert

我正在编写一个 Django-ORM 增强,它试图缓存模型并将模型保存推迟到事务结束。差不多都完成了,但是我在 SQL 语法中遇到了一个意想不到的困难。

我不是很懂 DBA,但据我所知,数据库对于许多小型查询来说并不能真正有效地工作。几乎没有更大的查询是更好的。例如,最好使用大批量插入(比如一次100行)而不是100个一行程序。

现在,在我看来,SQL 实际上并没有提供任何语句来对表执行批量更新。这个术语似乎是 很困惑,所以,我将解释我的意思是什么。我有一个任意数据的数组,每个条目描述一个表中的一行。我想更新表中的某些行,每一行都使用数组中相应条目中的数据。这个想法非常类似于批量插入。

例如: 我的表可以有两列 "id""some_col"。现在,描述批量更新数据的数组由三个条目 (1, 'first updated')(2, 'second updated')(3, 'third updated')组成。在更新之前,该表包含行: (1, 'first')(2, 'second')(3, 'third')

我偶然发现了这个帖子:

为什么批量插入/更新更快? 批量更新是如何工作的?

这看起来像是我想要的,但是我不能真正弄清楚最后的语法。

我还可以删除所有需要更新的行,然后使用批处理插入重新插入它们,但是我发现很难相信这实际上会有更好的效果。

我使用 PostgreSQL 8.4,因此这里也可以使用一些存储过程。然而,由于我计划最终开源项目,任何更多的可移植的想法或方法在不同的 RDBMS 上做同样的事情是最受欢迎的。

后续问题: 如何执行批处理“ insert-or-update”/“ upsert”语句?

测试结果

我已经在4个不同的表上执行了100次10次插入操作(总共1000次插入)。我在 Django 1.3上测试了 PostgreSQL 8.4后端。

以下是结果:

  • 所有操作通过 Django ORM 完成-每次通过 2.45秒,
  • 相同的操作,但没有 Django ORM-每次通过 1.48秒,
  • 只进行插入操作,不需要查询数据库中的序列值 0.72秒,
  • 只有插入操作,执行块为10(总共100个块) 0.19秒,
  • 只有插入操作,一个大的执行块 0.13秒
  • 只有插入操作,每个块大约有250条语句,0.12秒

结论: 在一个 Connection.execute ()中执行尽可能多的操作,Django 本身引入了大量的开销。

免责声明: 除了默认的主键索引之外,我没有引入任何索引,因此插入操作可能会因此运行得更快。

150843 次浏览

I've used 3 strategies for batch transactional work:

  1. Generate SQL statements on the fly, concatenate them with semicolons, and then submit the statements in one shot. I've done up to 100 inserts in this way, and it was quite efficient (done against Postgres).
  2. JDBC has batching capabilities built in, if configured. If you generate transactions, you can flush your JDBC statements so that they transact in one shot. This tactic requires fewer database calls, as the statements are all executed in one batch.
  3. Hibernate also supports JDBC batching along the lines of the previous example, but in this case you execute a flush() method against the Hibernate Session, not the underlying JDBC connection. It accomplishes the same thing as JDBC batching.

Incidentally, Hibernate also supports a batching strategy in collection fetching. If you annotate a collection with @BatchSize, when fetching associations, Hibernate will use IN instead of =, leading to fewer SELECT statements to load up the collections.

Turn off autocommit and just do one commit at the end. In plain SQL, this means issuing BEGIN at the start and COMMIT at the end. You would need to create a function in order to do an actual upsert.

Bulk inserts can be done as such:

INSERT INTO "table" ( col1, col2, col3)
VALUES ( 1, 2, 3 ) , ( 3, 4, 5 ) , ( 6, 7, 8 );

Will insert 3 rows.

Multiple updating is defined by the SQL standard, but not implemented in PostgreSQL.

Quote:

"According to the standard, the column-list syntax should allow a list of columns to be assigned from a single row-valued expression, such as a sub-select:

UPDATE accounts SET (contact_last_name, contact_first_name) = (SELECT last_name, first_name FROM salesmen WHERE salesmen.id = accounts.sales_id);"

Reference: http://www.postgresql.org/docs/9.0/static/sql-update.html

Bulk insert

You can modify the bulk insert of three columns by @Ketema:

INSERT INTO "table" (col1, col2, col3)
VALUES (11, 12, 13) , (21, 22, 23) , (31, 32, 33);

It becomes:

INSERT INTO "table" (col1, col2, col3)
VALUES (unnest(array[11,21,31]),
unnest(array[12,22,32]),
unnest(array[13,23,33]))

Replacing the values with placeholders:

INSERT INTO "table" (col1, col2, col3)
VALUES (unnest(?), unnest(?), unnest(?))

You have to pass arrays or lists as arguments to this query. This means you can do huge bulk inserts without doing string concatenation (and all its hazzles and dangers: sql injection and quoting hell).

Bulk update

PostgreSQL has added the FROM extension to UPDATE. You can use it in this way:

update "table"
set value = data_table.new_value
from
(select unnest(?) as key, unnest(?) as new_value) as data_table
where "table".key = data_table.key;

The manual is missing a good explanation, but there is an example on the postgresql-admin mailing list. I tried to elaborate on it:

create table tmp
(
id serial not null primary key,
name text,
age integer
);


insert into tmp (name,age)
values ('keith', 43),('leslie', 40),('bexley', 19),('casey', 6);


update tmp set age = data_table.age
from
(select unnest(array['keith', 'leslie', 'bexley', 'casey']) as name,
unnest(array[44, 50, 10, 12]) as age) as data_table
where tmp.name = data_table.name;
 

There are also other posts on StackExchange explaining UPDATE...FROM.. using a VALUES clause instead of a subquery. They might by easier to read, but are restricted to a fixed number of rows.

it is pretty fast to populate json into recordset (postgresql 9.3+)

big_list_of_tuples = [
(1, "123.45"),
...
(100000, "678.90"),
]


connection.execute("""
UPDATE mytable
SET myvalue = Q.myvalue
FROM (
SELECT (value->>0)::integer AS id, (value->>1)::decimal AS myvalue
FROM json_array_elements(%s)
) Q
WHERE mytable.id = Q.id
""",
[json.dumps(big_list_of_tuples)]
)