使用 GROUPBY 与使用 DISTINCT 时的性能差异很大

我使用一个包含500000个条目的表在 HSQLDB服务器上执行一些测试。该表没有索引。有5000个不同的业务密钥。我需要他们的名单。

我自然而然地从 DISTINCT查询开始:

SELECT DISTINCT business_key
FROM memory
WHERE concept <> 'case'   OR
attrib  <> 'status' OR
value   <> 'closed';

大概需要90秒! ! !

然后我尝试使用 GROUP BY:

SELECT business_key
FROM memory
WHERE concept <> 'case'   OR
attrib  <> 'status' OR
value   <> 'closed';
GROUP BY business_key

只需要一秒钟! ! !

试图找出差异,我运行 EXLAIN PLAN FOR,但它似乎提供了相同的信息,为两个查询。

EXLAIN PLAN FOR DISTINCT ...

isAggregated=[false]
columns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
join type=INNER
table=MEMORY
alias=M
access=FULL SCAN
condition = [    index=SYS_IDX_SYS_PK_10057_10058
other condition=[
OR arg_left=[
OR arg_left=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
VALUE = case, TYPE = CHARACTER]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
VALUE = status, TYPE = CHARACTER]]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
VALUE = closed, TYPE = CHARACTER]]]
]
]]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

EXLAIN PLAN FOR SELECT ... GROUP BY ...

isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[false]
columns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
join type=INNER
table=MEMORY
alias=M
access=FULL SCAN
condition = [    index=SYS_IDX_SYS_PK_10057_10058
other condition=[
OR arg_left=[
OR arg_left=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
VALUE = case, TYPE = CHARACTER]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
VALUE = status, TYPE = CHARACTER]]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
VALUE = closed, TYPE = CHARACTER]]]
]
]]
groupColumns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

剪辑

我做了额外的测试。HSQLDB中有500000条记录,其中包含所有不同的业务密钥,因此 DISTINCT的性能现在更好了——3秒,而 GROUP BY只用了大约9秒。

MySQL中,两个查询的结果相同:

MySQL: 500000行-5000个不同的业务键: 两个查询都是0.5秒 MySQL: 500000行-所有不同的业务键: SELECT DISTINCT ...-11秒 SELECT ... GROUP BY business_key-13秒

所以这个问题只和 HSQLDB有关。

如果有人能解释一下为什么会有如此巨大的差异,我将非常感激。

142303 次浏览

The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct approach is executed like:

  • Copy all business_key values to a temporary table
  • Sort the temporary table
  • Scan the temporary table, returning each item that is different from the one before it

The group by could be executed like:

  • Scan the full table, storing each value of business key in a hashtable
  • Return the keys of the hashtable

The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.

Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.