在 HTTP URL 的路径部分中,斜杠(“/”)相当于编码的斜杠(“% 2F”)

我有一个在 URL 的路径部分(而不是查询字符串)对待“/”和“% 2F”不同的网站。根据 RFC 或现实世界,这是一件坏事吗?

我问这个问题是因为我不断遇到一些令人惊讶的事情,比如我正在使用的 Web 框架(Ruby on Rails) ,以及它下面的层次(Passer,Apache,例如,我必须为 Apache 启用“ ALLOW _ ENCODED _ SLASHES”)。我现在倾向于完全摆脱编码斜杠,但我不知道我是否应该归档错误报告,我看到了奇怪的行为,包括编码斜杠。

至于为什么我有编码斜杠放在首位,基本上我有这样的路线:

:controller/:foo/:bar

Where: foo 类似于可以包含斜杠的路径。我认为最直接的方法就是只用 URL 转义 foo,这样斜杠就会被路由机制忽略。现在我有些怀疑,很明显框架并不真正支持这一点,但是根据 RFC,这样做是错误的吗?

以下是我收集到的一些信息:

RFC 1738(URL) :

通常,当一个八位组由一个字符表示时和当它被编码时,URL 具有相同的解释。但是,对于保留字符则不是这样: 对为特定方案保留的字符进行编码可能会更改 URL 的语义。

RFC 2396(URI) :

这些字符被称为“保留”,因为它们在 URI 组件中的使用仅限于其保留用途。如果 URI 组件的数据与保留用途冲突,那么在形成 URI 之前必须转义冲突的数据。

(这里的转义是否意味着编码保留字符以外的其他内容?)

RFC 2616(HTTP/1.1) :

“保留”和“不安全”集(参见 RFC 2396[42])中的字符以外的其他字符相当于它们的“%”“ HEX”编码。

Rails 也有 这个漏洞报告,它们似乎期望编码后的斜杠有不同的表现:

是的,我期待不同的结果,因为它们指向不同的资源。

它在根目录中查找文本文件‘ foo/bar’。非转义版本在 foo 目录中查找文件栏。

从 RFC 中可以清楚地看出,原始与编码对于无保留字符是等价的,但是保留字符的故事是怎样的呢?

175632 次浏览

From the data you gathered, I would tend to say that encoded "/" in an uri are meant to be seen as "/" again at application/cgi level.

That's to say, that if you're using apache with mod_rewrite for instance, it will not match pattern expecting slashes against URI with encoded slashes in it. However, once the appropriate module/cgi/... is called to handle the request, it's up to it to do the decoding and, for instance, retrieve a parameter including slashes as the first component of the URI.

If your application is then using this data to retrieve a file (whose filename contains a slash), that's probably a bad thing.

To sum up, I find it perfectly normal to see a difference of behaviour in "/" or "%2F" as their interpretation will be done at different levels.

I also have a site that has numerous urls with urlencoded characters. I am finding that many web APIs (including Google webmaster tools and several Drupal modules) trip over urlencoded characters. Many APIs automatically decode urls at some point in their process and then use the result as a URL or HTML. When I find one of these problems, I usually double encode the results (which turns %2f into %252f) for that API. However, this will break other APIs which are not expecting double encoding, so this is not a universal solution.

Personally I am getting rid of as many special characters in my URLs as possible.

Also, I am using id numbers in my URLs which do not depend on urldecoding:

example.com/blog/my-amazing-blog%2fstory/yesterday

becomes:

example.com/blog/12354/my-amazing-blog%2fstory/yesterday

in this case, my code only uses 12354 to look for the article, and the rest of the URL gets ignored by my system (but is still used for SEO.) Also, this number should appear BEFORE the unused URL components. that way, the url will still work, even if the %2f gets decoded incorrectly.

Also, be sure to use canonical tags to ensure that url mistakes don't translate into duplicate content.

If you use Tomcat, add '-Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true' in VM properties.

https://tomcat.apache.org/tomcat-7.0-doc/config/systemprops.html#Security

The story of %2F vs / was that, according to the initial W3C recommendations, slashes «must imply a hierarchical structure»:

The slash ("/", ASCII 2F hex) character is reserved for the delimiting of substrings whose relationship is hierarchical. This enables partial forms of the URI.

Example 2

The URIs

http://www.w3.org/albert/bertram/marie-claude

and

http://www.w3.org/albert/bertram%2Fmarie-claude

are NOT identical, as in the second case the encoded slash does not have hierarchical significance.

What to do if :foo in its natural form contains slashes? You wouldn't want it to Isn't that the distinction the recommendation is attempting to preserve? It specifically notes,

The similarity to unix and other disk operating system filename conventions should be taken as purely coincidental, and should not be taken to indicate that URIs should be interpreted as file names.

If one was building an online interface to a backup program, and wished to express the path as a part of the URL path, it would make sense to encode the slashes in the file path, as that is not really part of the hierarchy of the resource - and more importantly, the route. /backups/2016-07-28content//home/dan/ loses the root of the filesystem in the double slash. Escaping the slashes is the appropriate way to distinguish, as I read it.

encodeURI()/decodeURI and encodeURIComponent()/decodeURIComponent are utility functions to handle this. Read more here https://stackabuse.com/javascripts-encodeuri-function/