对所有基于文本的字段使用通用 varchar (255)有什么缺点吗?

我有一个 contacts表,其中包含字段,如 postcodefirst namelast nametowncountryphone number等,所有这些字段都被定义为 VARCHAR(255),即使这些字段将永远不会接近有255个字符。(如果您想知道,这是因为 Ruby on Rails 迁移默认将 String 字段映射到 VARCHAR(255),而我从未费心去覆盖它)。

因为 VARCHAR 只存储字段的实际字符数(以及字段长度) ,所以使用 VARCHAR(16)比使用 VARCHAR(255)有什么明显的优势(性能或其他方面)吗?

此外,这些字段中的大多数都有索引。字段中较大的 VARCHAR 大小是否会影响索引的大小或性能?

仅供参考,我正在使用 MySQL5。

24704 次浏览

It's good practice to allocate only a little over what you need. Phone numbers would never go this large.

One reason is that unless you validate against large entries, no doubt someone will use all there is. Then you might run out of space in your row. I'm not sure about MySQL limit but 8060 is the max rowsize in MS SQL.

A more normal default would be 50 imho, and then increase where need proves it.

I'm with you. Fussy attention to detail is a pain in the neck and has limited value.

Once upon a time, disk was a precious commodity and we used to sweat bullets to optimize it. The price of storage has fallen by a factor of 1,000, making the time spent on squeezing every byte less valuable.

If you use only CHAR fields, you can get fixed-length rows. This can save some disk real-restate if you picked accurate sizes for fields. You might get more densely-packed data (fewer I/O's for table scans) and faster updates (easier to locate open spaces in a block for updates and inserts.)

However, if you over-estimate your sizes, or your actual data sizes are variable, you'll wind up wasting space with CHAR fields. The data will wind up less densely packed (leading to more I/O's for big retrievals).

Generally, the performance benefits from attempting to put a size on variable fields are minor. You can easily benchmark by using VARCHAR(255) compared with CHAR(x) to see if you can measure the difference.

However, sometimes, I need to provide a "small", "medium", "large" hint. So I use 16, 64, and 255 for the sizes.

Nowadays, i can't imagine it really matters any more.

There's a computational overhead to using variable length fields, but with the excesses of CPUs today, it's not even worth considering. The I/O system are so slow as to make any computational costs to handle varchars effectively non-existent. In fact, the price of a varchar computationally is probably a net win over the amount of diskspace saved by using variable length fields over fixed length fields. You most likely have greater row density.

Now, the complexity of varchar fields is that you can't easily locate a record via it's record number. When you have a fixed length row size (with fixed length fields), it's trivial to compute the disk block that a row id points to. With a variable length rowsize, that kind of goes out the window.

So, now you need to maintain some kind of record number index, just like any other primary key, OR you need to make a robust row identifier that encodes details (such as the block, etc.) in to the identifier. If you do that, though, the id would have to be recalculated if ever the row is moved on persistent storage. No big deal, just need to rewrite all of the index entries and make sure the you either a) never expose it to the consumer or b) never assert that the number is reliable.

But since we have varchar fields today, the only value of varchar(16) over varchar(255) is that the DB will enforce the 16 char limit on the varchar(16). If the DB model is supposed to be actually representative of the physical data model, then having fields lengths can be of value. If, however, it's simply "storage" rather than a "model AND storage", there's no need whatsoever.

Then you simply need to discern between a text field that is indexable (such varchar) vs something that is not (like a text or CLOB field). The indexable fields tend to have a limit on size to facilitate the index whereas the CLOB fields do not (within reason).

In storage, VARCHAR(255) is smart enough to store only the length you need on a given row, unlike CHAR(255) which would always store 255 characters.

But since you tagged this question with MySQL, I'll mention a MySQL-specific tip: as rows are copied from the storage engine layer to the SQL layer, VARCHAR fields are converted to CHAR to gain the advantage of working with fixed-width rows. So the strings in memory become padded out to the maximum length of your declared VARCHAR column.

When your query implicitly generates a temporary table, for instance while sorting or GROUP BY, this can use a lot of memory. If you use a lot of VARCHAR(255) fields for data that doesn't need to be that long, this can make the temporary table very large.

You may also like to know that this "padding out" behavior means that a string declared with the utf8 character set pads out to three bytes per character even for strings you store with single-byte content (e.g. ascii or latin1 characters). And likewise utf8mb4 character set causes the string to pad out to four bytes per character in memory.

So a VARCHAR(255) in utf8 storing a short string like "No opinion" takes 11 bytes on disk (ten lower-charset characters, plus one byte for length) but it takes 765 bytes in memory, and thus in temp tables or sorted results.

I have helped MySQL users who unknowingly created 1.5GB temp tables frequently and filled up their disk space. They had lots of VARCHAR(255) columns that in practice stored very short strings.

It's best to define the column based on the type of data that you intend to store. It has benefits to enforce application-related constraints, as other folks have mentioned. But it has the physical benefits to avoid the memory waste I described above.

It's hard to know what the longest postal address is, of course, which is why many people choose a long VARCHAR that is certainly longer than any address. And 255 is customary because it is the maximum length of a VARCHAR for which the length can be encoded with one byte. It was also the maximum VARCHAR length in MySQL older than 5.0.

In my experience, if you allow a datatype of 255 characters, some stupid user (or some experienced tester) will actually fill that up.

Then you have all sorts of problems, including how much space you allow for those fields in reports and on-screen displays in your application. Not to mention the possibility of exceeding the per-row limit for data in your database (if you had more than a few of these 255 character fields).

Much easier to pick a reasonable limit at the beginning, then enforce that through the application and database.

In addition to the size and performance considerations of setting the size of a varchar (and possibly more important, as storage and processing get cheaper every second), the disadvantage of using varchar(255) "just because" is reduced data integrity.

Defining maximum limits for strings is a good thing to do to prevent longer than expected strings from entering the RDBMS and causing buffer overruns or exceptions/errors later when retrieving and parsing values from the database that are longer (more bytes) than expected.

For example, if you have a field that accepts two-character strings for country abbreviations then you have no conceivable reason to expect your users (in this context, programmers) to input full country names. Since you don't want them to enter "Antigua and Barbuda" (AG) or "Heard Island and McDonald Islands" (HM), you don't allow it at the database layer. Also, it is likely some programmers have not yet RTFMed the design documentation (which surely exists) to know not to do this.

Set the field to accept two characters and let the RDBMS deal with it (either gracefully by truncating or ungracefully by rejecting their SQL with an error).

Examples of real data that has no reason to exceed a certain length:

  • Canadian Postal Codes are of the format A1A1A1 and are always 6 characters in length, even for Santa Claus (6 characters excludes the space that can be specified for legibility).
  • email addresses - up to 64 bytes before the @, up to 255 bytes after. Never more, lest you break the Internet.
  • North American Phone Numbers are never more than 10 digits (excluding the country code).
  • Computers running (recent versions of) Windows cannot have computer names longer than 63 bytes, though more than 15 is not recommended and will break your Windows NT server farm.
  • State abbreviations are 2 characters (like the country codes exampled above)
  • UPS tracking numbers are either 18-, 12-, 11-, or 9-characters long. The 18-character numbers start with "1Z" and the 11-character numbers start with "T" which makes you wonder how they deliver all those packages if they don't know the difference between letters and numbers.

And so on...

Take the time to think about your data and its limits. If you're a architect, developer, or programmer, it's your job, after all.

By using a varchar(n) instead of varchar(255) you eliminate the problem where users (end-users, programmers, other programs) enter unexpectedly long data that will come back to haunt your code later.

And I didn't say you shouldn't also implement this restriction in the business logic code used by your application.

In a mysql context it can get important when working with indexes on said varchar columns, as mysql has a max. limit of 767bytes per index-row.

This means that when adding a index across several varchar 255 columns you can get to this limit rather quickly / even faster on utf8 or utf8mb4 columns as pointed out in the answers above