群集与非群集

我对 SQL (Server2008)的低级知识有限,现在正受到我们 DBA 的挑战。让我来解释一下(我已经提到了明显的陈述,希望我是对的,但如果你看到了什么错误,请告诉我)这个场景:

我们有一张桌子,上面放着“法院命令”。当我创建这个表时(名称: CourtOrder) ,我创建它的方式是:

CREATE TABLE dbo.CourtOrder
(
CourtOrderID INT NOT NULL IDENTITY(1,1), (Primary Key)
PersonId INT NOT NULL,
+ around 20 other fields of different types.
)

然后,我对主键应用了一个非聚集索引(为了提高效率)。我的理由是,它是一个唯一的字段(主键) ,应该进行索引,主要是为了选择目的,因为我们经常使用 Select from table where primary key = ...

然后,我对 PersonId 应用了一个丛集索引。原因是对特定的人物理订单进行分组,因为绝大多数工作是为一个人接受订单。那么,select from mytable where personId = ...

I have been pulled up on this now. I have been told that we should put the clustered index on the primary key, and the normal index on the personId. That seems very strange to me. First off, why would you put a clustered index on a unique column? what is it clustering? Surely that's a waste of the clustered index? I'd have believed a normal index would be used on a unique column. Also, clustering the index would mean we can't cluster a different column (One per table, right?).

我被告知我犯了一个错误的原因是他们认为在 PersonId 上放置一个聚集索引会使插入变慢。对于选择的速度增加5% ,插入和更新的速度将降低95% 。是这样吗?

他们说,因为我们集群了 PersonId,所以当我们插入或更改 PersonId 时,SQLServer 必须重新排列数据。

所以我问,如果 SQL 这么慢,为什么还要有 CLUSTERED INDEX 的概念呢?有他们说的那么慢吗?我应该如何设置索引以获得最佳性能?我本以为 SELECT 比 INSERT 用得更多... 但是他们说我们在 INSERTS 上有锁定问题..。

希望有人能帮我。

99570 次浏览

The distinction between a clustered vs. non-clustered index is that the clustered index determines the physical order of the rows in the database. In other words, applying the clustered index to PersonId means that the rows will be physically sorted by PersonId in the table, allowing an index search on this to go straight to the row (rather than a non-clustered index, which would direct you to the row's location, adding an extra step).

That said, it's unusual for the primary key not to be the clustered index, but not unheard of. The issue with your scenario is actually the opposite of what you're assuming: you want unique values in a clustered index, not duplicates. Because the clustered index determines the physical order of the row, if the index is on a non-unique column, then the server has to add a background value to rows who have a duplicate key value (in your case, any rows with the same PersonId) so that the combined value (key + background value) is unique.

The only thing I would suggest is not using a surrogate key (your CourtOrderId) column as the primary key, but instead use a compound primary key of the PersonId and some other uniquely-identifying column or set of columns. If that's not possible (or not practical), though, then put the clustered index on CourtOrderId.

I am by no means a SQL Expert...so take this as a developer's view rather than a DBA view..

Inserts on clustered (physically ordered) indexes that aren't in sequential order cause extra work for inserts/updates. Also, if you have many inserts happening at once and they are all occurring in the same location, you end up with contention. Your specific performance varies based on your data and how you access it. The general rule of thumb is to build your clustered index on the most unique narrow value in your table (typically the PK)

I'm assuming your PersonId won't be changing, so Updates don't come into play here. But consider a snapshot of a few rows with PersonId of 1 2 3 3 4 5 6 7 8 8

Now insert 20 new rows for PersonId of 3. First, since this is not a unique key, the server adds some extra bytes to your value (behind the scenes) to make it unique (which also adds extra space) and then the location where these will reside has to be altered. Compare that to inserting an auto-incrementing PK where the inserts happen at the end. The non technical explanation would likely come down to this: there is less 'leaf-shuffling' work to do if it's naturally progressing higher values at the end of the table versus reworking location of the existing items at that location while inserting your items.

Now, if you are having issues with Inserts then you are likely inserting a bunch of the same (or similar) PersonId values at once which is causing this extra work in various places throughout the table and the fragmentation is killing you. The downside of switching to the PK being clustered in your case, is if you are having insert issues today on PersonIds that vary in value spread throughout the table, if you switch your clustered index to the PK and all of the inserts now happen in one location then your problem may actually get worse due to increased contention concentration. (On the flip side, if your inserts today are not spread out all over, but are all typically bunched in similar areas, then your problem will likely ease by switching your clustered index away from PersonId to your PK because you'll be minimizing the fragmentation.)

Your performance problems should be analyzed to your unique situation and take these types of answers as general guidelines only. Your best bet is to rely on a DBA that can validate exactly where your problems lie. It sounds like you have resource contention issues that may be beyond a simple index tweak. This could be a symptom of a much larger problem. (Likely design issues...otherwise resource limitations.)

In any case, good luck!

Some authors do suggest not "wasting" the CI on an identity column if there is an alternative that would benefit range queries.

From MSDN Clustered Index Design Guidelines the key should be chosen according to the following criteria

  1. Can be used for frequently used queries.
  2. Provide a high degree of uniqueness.
  3. Can be used in range queries.

Your CourtOrderID column meets 2. Your PersonId meets 1 and 3. As most rows will end up with the uniqueifier added anyway you might as well just declare it as unique and use PersonId,CourtOrderID as this will be the same width but be more useful as the clustered index key is added to all NCIs as the row locator and this will allow them to cover more queries.

The main issue with using PersonId,CourtOrderID as the CI is that logical fragmentation will likely ensue (and this particularly affects the range queries you are trying to help) so you would need to monitor fill factor, and fragmentation levels and perform index maintenance more often.

Some db with some nasty selects, joins in a stored procedure - only diffrence is the index

INDEXES - clustered vs nonclustered

  891 rows
10 sec
NONCLUSTERED


OR


891 rows
14 sec
CLUSTERED

It's explained in the following link: https://msdn.microsoft.com/en-us/ms190457.aspx

Clustered

  • Clustered indexes sort and store the data rows in the table or view based on their key values. These are the columns included in the index definition. There can be only one clustered index per table, because the data rows themselves can be sorted in only one order.

  • The only time the data rows in a table are stored in sorted order is when the table contains a clustered index. When a table has a clustered index, the table is called a clustered table. If a table has no clustered index, its data rows are stored in an unordered structure called a heap.

Nonclustered

  • Nonclustered indexes have a structure separate from the data rows. A nonclustered index contains the nonclustered index key values and each key value entry has a pointer to the data row that contains the key value.

  • The pointer from an index row in a nonclustered index to a data row is called a row locator. The structure of the row locator depends on whether the data pages are stored in a heap or a clustered table. For a heap, a row locator is a pointer to the row. For a clustered table, the row locator is the clustered index key.

  • You can add nonkey columns to the leaf level of the nonclustered index to by-pass existing index key limits, 900 bytes and 16 key columns, and execute fully covered, indexed, queries.