JSON 字符编码-UTF-8是否被浏览器很好地支持,或者我应该使用数字转义序列?

我正在编写一个使用 json 来表示其资源的 Web 服务,我有点不知所措,想不出对 json 进行编码的最佳方法。阅读 json rfc (http://www.ietf.org/rfc/rfc4627.txt)可以清楚地看到,首选编码是 utf-8。但是 rfc 也描述了用于指定字符的字符串转义机制。我假设这通常用于转义 non-ascii 字符,从而产生 utf-8有效的 ascii。

假设我有一个 json 字符串,它包含非 ascii 的 unicode 字符(代码点)。我的网络服务应该只是 utf-8编码,并返回它,还是应该逃避所有这些非 ascii 字符,并返回纯 ascii?

我希望浏览器能够使用 jsonp 或 eval 执行结果。这会影响你的决定吗?我对各种浏览器对 utf-8的 javascript 支持的了解不足。

编辑: 我想澄清一下,我对于如何对结果进行编码的主要关注点实际上是浏览器对结果的处理。我读到的内容表明,浏览器在特别使用 JSONP 时可能对编码非常敏感。我还没有找到任何关于这个主题的有用信息,所以我将不得不开始做一些测试,看看会发生什么。理想情况下,我希望只转义那些所需的少量字符,并且只对结果进行 utf-8编码。

249421 次浏览

ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:

All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F)

The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.

The ability for JSON encoders to use the numeric escape sequences instead just offers you more choice. One reason you may choose the numeric escape sequences would be if a transport mechanism in between your encoder and the intended decoder is not binary-safe.

Another reason you may want to use numeric escape sequences is to prevent certain characters appearing in the stream, such as <, & and ", which may be interpreted as HTML sequences if the JSON code is placed without escaping into HTML or a browser wrongly interprets it as HTML. This can be a defence against HTML injection or cross-site scripting (note: some characters MUST be escaped in JSON, including " and \).

Some frameworks, including PHP's json_encode() (by default), always do the numeric escape sequences on the encoder side for any character outside of ASCII. This is a mostly unnecessary extra step intended for maximum compatibility with limited transport mechanisms and the like. However, this should not be interpreted as an indication that any JSON decoders have a problem with UTF-8.

So, I guess you just could decide which to use like this:

  • Just use UTF-8, unless any software you are using for storage or transport between encoder and decoder isn't binary-safe.

  • Otherwise, use the numeric escape sequences.

I had a problem there. When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".

Then with PHP json_decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8_encode() before json_decode().

Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.

I had a similar problem with é char... I think the comment "it's possible that the text you're feeding it isn't UTF-8" is probably close to the mark here. I have a feeling the default collation in my instance was something else until I realized and changed to utf8... problem is the data was already there, so not sure if it converted the data or not when i changed it, displays fine in mysql workbench. End result is that php will not json encode the data, just returns false. Doesn't matter what browser you use as its the server causing my issue, php will not parse the data to utf8 if this char is present. Like i say not sure if it is due to converting the schema to utf8 after data was present or just a php bug. In this case use json_encode(utf8_encode($string));

I was facing the same problem. It works for me. Please check this.

json_encode($array,JSON_UNESCAPED_UNICODE);

Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8.

FYI, RFC 4627 is no longer the official JSON spec. It was obsoleted in 2014 by RFC 7159, which was then obsoleted in 2017 by RFC 8259, which is the current spec.

RFC 8259 states:

8.1. Character Encoding

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.

Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.