UTF-8编码字符的最大字节数是多少?

一个 UTF-8编码字符的最大字节数是多少?

我将对用 UTF-8编码的 String 的字节进行加密,因此需要能够计算出 UTF-8编码的 String 的最大字节数。

请确认一个 UTF-8编码字符的最大字节数

67795 次浏览

The maximum number of bytes per character is 4 according to RFC3629 which limited the character table to U+10FFFF:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets.

(The original specification allowed for up to six byte character codes for code points past U+10FFFF.)

Characters with a code less than 128 will require 1 byte only, and the next 1920 character codes require 2 bytes only. Unless you are working with an esoteric language, multiplying the character count by 4 will be a significant overestimation.

Without further context, I would say that the maximum number of bytes for a character in UTF-8 is

answer: 6 bytes

The author of the accepted answer correctly pointed this out as the "original specification". That was valid through RFC-2279 1. As J. Cocoe pointed out in the comments below, this changed in 2003 with RFC-3629 2, which limits UTF-8 to encoding for 21 bits, which can be handled with the encoding scheme using four bytes.

answer if covering all unicode: 4 bytes

But, in Java <= v7, they talk about a 3-byte maximum for representing unicode with UTF-8? That's because the original unicode specification only defined the basic multi-lingual plane (BMP), i.e. it is an older version of unicode, or subset of modern unicode. So

answer if representing only original unicode, the BMP: 3 bytes

But, the OP talks about going the other way. Not from characters to UTF-8 bytes, but from UTF-8 bytes to a "String" of bytes representation. Perhaps the author of the accepted answer got that from the context of the question, but this is not necessarily obvious, so may confuse the casual reader of this question.

Going from UTF-8 to native encoding, we have to look at how the "String" is implemented. Some languages, like Python >= 3 will represent each character with integer code points, which allows for 4 bytes per character = 32 bits to cover the 21 we need for unicode, with some waste. Why not exactly 21 bits? Because things are faster when they are byte-aligned. Some languages like Python <= 2 and Java represent characters using a UTF-16 encoding, which means that they have to use surrogate pairs to represent extended unicode (not BMP). Either way that's still 4 bytes maximum.

answer if going UTF-8 -> native encoding: 4 bytes

So, final conclusion, 4 is the most common right answer, so we got it right. But, mileage could vary.

The maximum number of bytes to support US-ASCII, a standard English alphabet encoding, is 1. But limiting text to English is becoming less desirable or practical as time goes by.

Unicode was designed to represent the glyphs of all human languages, as well as many kinds of symbols, with a variety of rendering characteristics. UTF-8 is an efficient encoding for Unicode, although still biased toward English. UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction.

While the maximum number of bytes per UTF-8 character is 3 for supporting just the 2-byte address space of Plane 0, the Basic Multilingual Plane (BMP), which can be accepted as minimal support in some applications, it is 4 for supporting all 17 current planes of Unicode (as of 2019). It should be noted that many popular "emoji" characters are likely to be located in Plane 16, which requires 4 bytes.

However, this is just for basic character glyphs. There are also various modifiers, such as making accents appear over the previous character, and it is also possible to link together an arbitrary number of code points to construct one complex "grapheme". In real world programming, therefore, the use or assumption of a fixed maximum number of bytes per character will likely eventually result in a problem for your application.

These considerations imply that UTF-8 character strings should not "expanded" into arrays of fixed length prior to processing, as has sometimes been done. Instead, programming should be done directly, using string functions specifically designed for UTF-8.

Condidering just technical limitations - it's possible to have up to 7 bytes following current UTF8 encoding scheme. According to it - if first byte is not self-sufficient ASCII character, than it should have pattern: 1(n)0X(7-n), where n is <= 7.

Also theoretically it could be 8 but then first byte would have no zero bit at all. While other aspects, like continuation byte differing from leading, are still there (allowing error detection), I heared, that byte 11111111 could be invalid, but I can't be sure about that.

Limitatation for max 4 bytes is most likely for compatibility with UTF-16, which I tend to consider a legacy, because the only quality where it excels, is processing speed, but only if string byte order matches (i.e. we read 0xFEFF in the BOM).