UTF-8字符的问题; 我看到的不是我存储的内容

我试图使用 UTF-8,但遇到了麻烦。

我尝试了很多事情,以下是我得到的结果:

  • ????代替亚洲字符。甚至对于欧洲文本,我用 Se?or代替 Señor
  • 奇怪的胡言乱语(莫吉巴克?) ,如 Señor新浪新闻新浪新闻
  • 黑色钻石,如先生。
  • 最后,我遇到了数据丢失或者至少被截断的情况: Se代表 Señor
  • 即使我得到的文本到 听着的权利,它没有正确的 排序

我做错了什么? 我如何修复 密码? 我可以恢复 资料,如果可以,如何?

88327 次浏览

This problem plagues the participants of this site, and many others.

You have listed the five main cases of CHARACTER SET troubles.

Best Practice

Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)

utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.

Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.

I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.

Overview of what you should do

  • Have your editor, etc. set to UTF-8.
  • HTML forms should start like <form accept-charset="UTF-8">.
  • Have your bytes encoded as UTF-8.
  • Establish UTF-8 as the encoding being used in the client.
  • Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
  • <meta charset=UTF-8> at the beginning of HTML
  • Stored Routines acquire the current charset/collation. They may need rebuilding.

UTF-8 all the way through

More details for computer languages (and its following sections)

Test the data

Viewing the data with a tool or with SELECT cannot be trusted. Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled. So, pick a table and column that has some non-English text and do

SELECT col, HEX(col) FROM tbl WHERE ...

The HEX for correctly stored UTF-8 will be

  • For a blank space (in any language): 20
  • For English: 4x, 5x, 6x, or 7x
  • For most of Western Europe, accented letters should be Cxyy
  • Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
  • Most of Asia: Exyyzz
  • Emoji and some of Chinese: F0yyzzww
  • More details

Specific causes and fixes of the problems seen

Truncated text (Se for Señor):

  • The bytes to be stored are not encoded as utf8mb4. Fix this.
  • Also, check that the connection during reading is UTF-8.

Black Diamonds with question marks (Se�or for Señor); one of these cases exists:

Case 1 (original bytes were not UTF-8):

  • The bytes to be stored are not encoded as utf8. Fix this.
  • The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
  • Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).

Case 2 (original bytes were UTF-8):

  • The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
  • Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).

Black diamonds occur only when the browser is set to <meta charset=UTF-8>.

Question Marks (regular ones, not black diamonds) (Se?or for Señor):

  • The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
  • The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
  • Also, check that the connection during reading is UTF-8.

Mojibake (Señor for Señor): (This discussion also applies to Double Encoding, which is not necessarily visible.)

  • The bytes to be stored need to be UTF-8-encoded. Fix this.
  • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
  • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
  • HTML should start with <meta charset=UTF-8>.

If the data looks correct, but won't sort correctly, then either you have picked the wrong collation, or there is no collation that suits your need, or you have Double Encoding.

Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.

é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD

That is, the hex is about twice as long as it should be. This is caused by converting from latin1 (or whatever) to utf8, then treating those bytes as if they were latin1 and repeating the conversion. The sorting (and comparing) does not work correctly because it is, for example, sorting as if the string were Señor.

Fixing the Data, where possible

For Truncation and Question Marks, the data is lost.

For Mojibake / Double Encoding, ...

For Black Diamonds, ...

The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

  1. 将代码IDE语言设置为UTF-8

  2. 添加<;元字符集=";UTF-8";>;到您收集数据表单的网页标题。

  3. 检查MySQL表定义,如下所示:

     CREATE TABLE your_table (
    ...
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8
    
  4. 如果您使用的是PDO,请确保

    $options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
    $dbL = new PDO($pdo, $user, $pass, $options);
    

如果你已经有了一个有上述问题的大型数据库,你可以尝试用正确的字符集导出SIDU,然后用UTF-8导回。

根据服务器的设置方式,您必须相应地更改编码。你所说的UTF8应该是最好的。然而,如果你得到了奇怪的字符,如果你将网页编码改为ANSI,它可能会有所帮助。

这在我设置PHP Mysqli时对我很有帮助。这可能会帮助您了解更多:Notepad++中的ANSI到UTF-8

在服务器迁移之后,我的两个项目也遇到了类似的问题。在搜索和尝试了很多解决方案后,我找到了这个:

mysqli_set_charset($con,"utf8mb4");

将此行添加到我的配置文件后,一切工作正常!

当我希望解决HTML查询中的插入时,我发现了这个MysqliPHP mysqli设置_charset()函数—的解决方案。

我也在寻找同样的问题。我花了将近一个月的时间才找到合适的解决方案。

首先,你必须更新你的数据库,将所有最近的字符和排序规则更新到UTF8MB4或至少支持UTF-8数据。

对于Java:

在建立JDBC连接时,将其作为参数添加到连接URL UseUnicode=是&;字符编码=UTF-8中,它就可以工作了。

对于Python:

在查询数据库之前,请尝试在游标上强制执行此操作

* cursor.execute('SET NAMES utf8mb4') cursor.execute("SET CHARACTER SET utf8mb4") cursor.execute("SET character_set_connection=utf8mb4") *

If it does not work, happy hunting for the right solution.