C + + 11中字符串文字的 Unicode 编码

相关问题之后,我想了解一下 C + + 11中的新字符和字符串文字类型。看起来我们现在有四种字符和五种字符串文字。字符类型:

char     a =  '\x30';         // character, no semantics
wchar_t  b = L'\xFFEF';       // wide character, no semantics
char16_t c = u'\u00F6';       // 16-bit, assumed UTF16?
char32_t d = U'\U0010FFFF';   // 32-bit, assumed UCS-4

还有字符串:

char     A[] =  "Hello\x0A";         // byte string, "narrow encoding"
wchar_t  B[] = L"Hell\xF6\x0A";      // wide string, impl-def'd encoding
char16_t C[] = u"Hell\u00F6";        // (1)
char32_t D[] = U"Hell\U000000F6\U0010FFFF"; // (2)
auto     E[] = u8"\u00F6\U0010FFFF"; // (3)

问题是: \x/\u/\U字符引用是否可以与所有字符串类型自由组合?所有的字符串类型都是固定宽度的吗? 也就是说,数组包含的元素和文字中显示的一样多吗? 或者 \x/\u/\U引用被扩展成可变的字节数?u""u8""字符串是否具有编码语义,例如,我可以说 char16_t x[] = u"\U0010FFFF",并且非 BMP 的编码点被编码成两个单元的 UTF16序列?u8也是如此吗?在(1)中,我可以用 \u写单独的代理吗?最后,是否有任何字符串函数能够识别编码(例如,它们能够识别字符并能够检测无效的字节序列) ?

这是一个有点开放式的问题,但我希望尽可能完整地了解新 C + + 11的新 UTF 编码和类型设施。

60602 次浏览

Are the \x/\u/\U character references freely combinable with all string types?

No. \x can be used in anything, but \u and \U can only be used in strings that are specifically UTF-encoded. However, for any UTF-encoded string, \u and \U can be used as you see fit.

Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \x/\u/\U references get expanded into a variable number of bytes?

Not in the way you mean. \x, \u, and \U are converted based on the string encoding. The number of those "code units" (using Unicode terms. A char16_t is a UTF-16 code unit) values depends on the encoding of the containing string. The literal u8"\u1024" would create a string containing 2 chars plus a null terminator. The literal u"\u1024" would create a string containing 1 char16_t plus a null terminator.

The number of code units used is based on the Unicode encoding.

Do u"" and u8"" strings have encoding semantics, e.g. can I say char16_t x[] = u"\U0010FFFF", and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence?

u"" creates a UTF-16 encoded string. u8"" creates a UTF-8 encoded string. They will be encoded per the Unicode specification.

In (1), can I write lone surrogates with \u?

Absolutely not. The specification expressly forbids using the UTF-16 surrogate pairs (0xD800-0xDFFF) as codepoints for \u or \U.

Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?

Absolutely not. Well, allow me to rephrase that.

std::basic_string doesn't deal with Unicode encodings. They certainly can store UTF-encoded strings. But they can only think of them as sequences of char, char16_t, or char32_t; they can't think of them as a sequence of Unicode codepoints that are encoded with a particular mechanism. basic_string::length() will return the number of code units, not code points. And obviously, the C standard library string functions are totally useless

It should be noted however that "length" for a Unicode string does not mean the number of codepoints. Some code points are combining "characters" (an unfortunate name), which combine with the previous codepoint. So multiple codepoints can map to a single visual character.

Iostreams can in fact read/write Unicode-encoded values. To do so, you will have to use a locale to specify the encoding and properly imbue it into the various places. This is easier said than done, and I don't have any code on me to show you how.