在 MySQL 中的 UUID 性能? ?

我们正在考虑使用 UUID 值作为 MySQL 数据库的主键。插入的数据来自数十台、数百台甚至数千台远程计算机,插入速度为每秒100-40,000次,我们永远不会做任何更新。

在我们开始筛选数据之前,数据库本身通常会达到大约5000万条记录,所以不是一个大型数据库,但也不小。我们还计划在 InnoDB 上运行,不过如果有更好的引擎,我们愿意改变这一点。

我们已经准备好使用 Java 的 Type4UUID,但是在测试中发现了一些奇怪的行为。首先,我们以 varchar (36)的形式存储,我现在意识到我们最好使用二进制(16)——尽管我不确定这样会好到什么程度。

更大的问题是: 当我们拥有5000万条记录时,这些随机数据会把索引搞得多糟糕?例如,如果我们使用类型1的 UUID,其中最左边的位是时间戳,我们是否会更好?或者也许我们应该完全抛弃 UUID,考虑 auto _  增量主 key?

我正在寻找有关不同类型 UUID 在 MySQL 中作为索引/主键存储时的性能的一般性想法/提示。谢谢!

79303 次浏览

A UUID is a Universally Unique ID. It's the universally part that you should be considering here.

Do you really need the IDs to be universally unique? If so, then UUIDs may be your only choice.

I would strongly suggest that if you do use UUIDs, you store them as a number and not as a string. If you have 50M+ records, then the saving in storage space will improve your performance (although I couldn't say by how much).

If your IDs do not need to be universally unique, then I don't think that you can do much better then just using auto_increment, which guarantees that IDs will be unique within a table (since the value will increment each time)

Something to take into consideration is that Autoincrements are generated one at a time and cannot be solved using a parallel solution. The fight for using UUIDs eventually comes down to what you want to achieve versus what you potentially sacrifice.

On performance, briefly:

A UUID like the one above is 36 characters long, including dashes. If you store this VARCHAR(36), you're going to decrease compare performance dramatically. This is your primary key, you don't want it to be slow.

At its bit level, a UUID is 128 bits, which means it will fit into 16 bytes, note this is not very human readable, but it will keep storage low, and is only 4 times larger than a 32-bit int, or 2 times larger than a 64-bit int. I will use a VARBINARY(16) Theoretically, this can work without a lot of overhead.

I recommend reading the following two posts:

I reckon between the two, they answer your question.

What about some hand crafted UID? Give each of the thousands of servers an ID and make primary key a combo key of autoincrement,MachineID ???

Since the primary key is generated decentralised, you don't have the option of using an auto_increment anyway.

If you don't have to hide the identity of the remote machines, use Type 1 UUIDs instead of UUIDs. They are easier to generate and can at least not hurt the performance of the database.

The same goes for varchar (char, really) vs. binary: it can only help matters. Is it really important, how much performance is improved?

I would assign each server a numeric ID in a transactional manner. Then, each record inserted will just autoincrement its own counter. Combination of ServerID and RecordID will be unique. ServerID field can be indexed and future select performance based on ServerID (if needed) may be much better.

At my job, we use UUID as PKs. What I can tell you from experience is DO NOT USE THEM as PKs (SQL Server by the way).

It's one of those things that when you have less than 1000 records it;s ok, but when you have millions, it's the worst thing you can do. Why? Because UUID are not sequential, so everytime a new record is inserted MSSQL needs to go look at the correct page to insert the record in, and then insert the record. The really ugly consequence with this is that the pages end up all in different sizes and they end up fragmented, so now we have to do de-fragmentation periodic.

When you use an autoincrement, MSSQL will always go to the last page, and you end up with equally sized pages (in theory) so the performance to select those records is much better (also because the INSERTs will not block the table/page for so long).

However, the big advantage of using UUID as PKs is that if we have clusters of DBs, there will not be conflicts when merging.

I would recommend the following model: 1. PK INT Identity 2. Additional column automatically generated as UUID.

This way, the merge process is possible (UUID would be your REAL key, while the PK would just be something temporary that gives you good performance).

NOTE: That the best solution is to use NEWSEQUENTIALID (like I was saying in the comments), but for legacy app with not much time to refactor (and even worse, not controlling all inserts), it is not possible to do. But indeed as of 2017, I'd say the best solution here is NEWSEQUENTIALID or doing Guid.Comb with NHibernate.

Hope this helps

I tend to avoid UUID simply because it is a pain to store and a pain to use as a primary key but there are advantages. The main one is they are UNIQUE.

I usually solve the problem and avoid UUID by using dual key fields.

COLLECTOR = UNIQUE ASSIGNED TO A MACHINE

ID = RECORD COLLECTED BY THE COLLECTOR (auto_inc field)

This offers me two things. Speed of auto-inc fields and uniqueness of data being stored in a central location after it is collected and grouped together. I also know while browsing the data where it was collected which is often quite important for my needs.

I have seen many cases while dealing with other data sets for clients where they have decided to use UUID but then still have a field for where the data was collected which really is a waste of effort. Simply using two (or more if needed) fields as your key really helps.

I have just seen too many performance hits using UUID. They feel like a cheat...

Instead of centrally generating unique keys for each insertion, how about allocating blocks of keys to individual servers? When they run out of keys, they can request a new block. Then you solve the problem of overhead by connecting for each insert.

Keyserver maintains next available id

  • Server 1 requests id block.
  • Keyserver returns (1,1000)
    Server 1 can insert a 1000 records until it needs to request a new block
  • Server 2 requests index block.
  • Keyserver returns (1001,2000)
  • etc...

You could come up with a more sophisticated version where a server could request the number of needed keys, or return unused blocks to the keyserver, which would then of course need to maintain a map of used/unused blocks.

The short answer is that many databases have performance problems (in particular with high INSERT volumes) due to a conflict between their indexing method and UUIDs' deliberate entropy in the high-order bits. There are several common hacks:

  • choose a different index type (e.g. nonclustered on MSSQL) that doesn't mind it
  • munge the data to move the entropy to lower-order bits (e.g. reordering bytes of V1 UUIDs on MySQL)
  • make the UUID a secondary key with an auto-increment int primary key

... but these are all hacks--and probably fragile ones at that.

The best answer, but unfortunately the slowest one, is to demand your vendor improve their product so it can deal with UUIDs as primary keys just like any other type. They shouldn't be forcing you to roll your own half-baked hack to make up for their failure to solve what has become a common use case and will only continue to grow.

I realize this question is rather old but I did hit upon it in my research. Since than a number of things happened (SSD are ubiquitous InnoDB got updates etc).

In my research I found this rather interesting post on performance:

claiming that due to the randomness of a GUID/UUID index trees can get rather unbalanced. in the MariaDB KB I found another post suggested a solution. But since than the new UUID_TO_BIN takes care of this. This function is only available in MySQL (tested version 8.0.18) and not in MariaDB (version 10.4.10)

TL;DR: Store UUID as converted/optimized BINARY(16) values.

The main case where UUIDs cause miserable performance is ...

When the INDEX is too big to be cached in the buffer_pool, each lookup tends to be a disk hit. For HDD, this can slow down the access by 10x or worse. (No, that is not a typo for "10%".) With SSDs, the slowdown is less, but still significant.

This applies to any "hash" (MD5, SHA256, etc), with one exception: A type-1 UUID with its bits rearranged.

Background and manual optimization: UUIDs

MySQL 8.0: see UUID_TO_BIN() and BIN_TO_UUID()

MariaDB 10.7 carries this further with its UUID datatype.