SQL IN 对性能有害吗?

我有一个疑问是这样的:

SELECT FieldX, FieldY FROM A
WHERE FieldW IN (108, 109, 113, 138, 146, 160,
307, 314, 370, 371, 441, 454 ,457, 458, 479, 480,
485, 488, 490, 492, 519, 523, 525, 534, 539, 543,
546, 547, 550, 564, 573, 629, 642, 643, 649, 650,
651, 694, 698, 699, 761, 762, 768, 772, 773, 774,
775, 778, 784, 843, 844, 848, 851, 852, 853, 854,
855, 856, 857, 858, 859, 860, 861, 862, 863, 864,
865, 868, 869, 871, 872, 873, 891)

有这么多选项的 IN 子句,是否不利于查询性能?我在应用程序中经历了许多超时,我相信这可能是这类问题的根源之一。我是否可以在不删除数字的情况下优化查询,使用任何好的 SQL 提示?

编辑:

@ KM 这些是另一张桌子上的钥匙。这是一个论坛应用程序,简要说明: c # 从数据库获取所有论坛,并将其存储在应用程序缓存中。在 C # 调用获取这些论坛和这个用户的线程的过程之前,考虑到权限和一些业务逻辑,c # 会对“所有论坛”集合进行一些逻辑过滤。超时发生在数据库上,而不是应用程序本身。在查询上执行所有这些逻辑将需要大量的内部连接,而且我不能100% 确定可以在过程中执行所有这些操作。

我用的是 SQLServer2000

89740 次浏览

There are better ways to code it, but I doubt it's the cause of your timeouts, especially if it's only a SELECT. You should be able to determine that by looking at your query traces though. But recoding this would be optimization by guessing, and an unlikely guess at that.

Let's start with a query plan for the query that is actually timing out. Do you know for sure which query it is?

If you have a good index on FieldW, using that IN is perfectly right.

I have just tested and SQL 2000 does a Clustered Index Scan when using the IN.

Typically the IN clause is harmful to performance, but what is "bad" depends on the application, data, database size, etc. You need to test your own app to see what is best.

You can try creating a temporary table, insert your values to it and use the table instead in the IN predicate.

AFAIK, SQL Server 2000 cannot build a hash table of the set of constants, which deprives the optimizer of possibility to use a HASH SEMI JOIN.

This will help only if you don't have an index on FieldW (which you should have).

You can also try to include your FieldX and FieldY columns into the index:

CREATE INDEX ix_a_wxy ON a (FieldW, FieldX, FieldY)

so that the query could be served only by using the index.

SQL Server 2000 lacks INCLUDE option for CREATE INDEX and this may degrade DML performance a little but improve the query performance.

Update:

From your execution plan I see than you need a composite index on (SettingsID, SectionID)

SQL Server 2000 indeed can built a hash table out of a constant list (and does it), but Hash Semi Join most probably will be less efficient than a Nested Loop for query query.

And just a side note: if you need to know the count of rows satisfying the WHERE condition, don't use COUNT(column), use COUNT(*) instead.

A COUNT(column) does not count the rows for which the column value is NULL.

This means that, first, you can get the results you didn't expect, and, second, the optimizer will need to do an extra Key Lookup / Bookmark Lookup if your column is not covered by an index that serves the WHERE condition.

Since ThreadId seems to be a CLUSTERED PRIMARY KEY, it's all right for this very query, but try to avoid it in general.

Basically what that where clause does is "FieldW = 108 OR FieldW = 109 OR FieldW = 113...". Sometimes you can get better performance by doing multiple selects, and combining them with union. For example:

SELECT FieldX, FieldY FROM A WHERE FieldW = 108
UNION ALL
SELECT FieldX, FieldY FROM A WHERE FieldW = 109

But of course that is impractical when you're comparing to so many values.

Another option might be to insert those values into a temporary table and then joining the A table to that temp table.

the size of your table will determine the speed when using this statement. If it's not a very large table...this statement isn't affecting your performance.

Performance can only be judged in the context of what you are trying to do. In this case you are requesting the retrieval of around 70 rows (assuming thay are unique values), so you can expect something like 70 times the duration of retrieving a single value. It might be less due to caching, or course.

However, the query optimiser may need or choose to perform a full table scan in order to retrieve the values, in which case performace will be little different than retrieving a single value via the same access plan.

If you can use other things than IN : do it (I was using IN in some case not really the good way : I can easily replace with exist and it is quicker)

In your case : It seems not so bad.

IN is exactly the same thing as writing a big list of ORs. And OR often makes queries unSARGable, so your indexes may be ignored and the plan goes for a full scan.

Depending on your data distribution, additional predicates in your WHERE clause may improve performance. For example, if the set of ids is small relative to the total number in the table, and you know that the ids are relatively close together (perhaps they will usually be recent additions, and therefore clustered at the high end of the range), you could try and include the predicate "AND FieldW BETWEEN 109 AND 891" (after determining the min & max id in your set in the C# code). It may be that doing a range scan on those columns (if indexed) works faster than what is currently being used.

There are several considerations when writing a query using the IN operator that can have an effect on performance.

First, IN clauses are generally internally rewritten by most databases to use the OR logical connective. So col IN ('a','b','c') is rewritten to: (COL = 'a') OR (COL = 'b') or (COL = 'c'). The execution plan for both queries will likely be equivalent assuming that you have an index on col.

Second, when using either IN or OR with a variable number of arguments, you are causing the database to have to re-parse the query and rebuild an execution plan each time the arguments change. Building the execution plan for a query can be an expensive step. Most databases cache the execution plans for the queries they run using the EXACT query text as a key. If you execute a similar query but with different argument values in the predicate - you will most likely cause the database to spend a significant amount of time parsing and building execution plans. This is why bind variables are strongly recommended as a way to ensure optimal query performance.

Third, many database have a limit on the complexity of queries they can execute - one of those limits is the number of logical connectives that can be included in the predicate. In your case, a few dozen values are unlikely to reach the built-in limit of the database, but if you expect to pass hundreds or thousands of value to an IN clause - it can definitely happen. In which case the database will simply cancel the query request.

Fourth, queries that include IN and OR in the predicate cannot always be optimally rewritten in a parallel environment. There are various cases where parallel server optimization do not get applied - MSDN has a decent introduction to optimizing queries for parallelism. Generally though, queries that use the UNION ALL operator are trivially parrallelizable in most databases - and are preferred to logical connectives (like OR and IN) when possible.

Here is your answer...

https://web.archive.org/web/20211020153409/https://www.4guysfromrolla.com/webtech/031004-1.shtml

Basically, you want to create a function that will split a string and populate a temp table with the split contents. Then you can join to that temp table and manipulate your data. The above explains things pretty well. I use this technique a lot.

In your specific case use a join to the temp table instead of an in clause, much faster.

You might try something like:

select a.FieldX, a.FieldY
from (
select FieldW = 108 union
select FieldW = 109 union
select FieldW = 113 union
...
select FieldW = 891
) _a
join A a on a.FieldW = _a.FieldW

It may be appropriate for your situation, such as when you want to generate a single SQL statement dynamically. On my machine (SQL Server 2008 Express), testing with a small number (5) of FieldW values and a large number (100,000) of rows in A, this uses an index seek on A with a nested loops join between A and _a, which is probably what you're looking for.

I generally would use a user-defined table type for queries like this.

CREATE TYPE [dbo].[udt_int] AS TABLE (
[id] [int] NOT NULL
)

Using a table-variable and filling it with rows for each of your numbers, you can do:

SELECT
FieldX,
FieldY
FROM A
INNER JOIN @myIds B ON
A.FieldW = B.id