如何在 python 中规范化 URL

小开

Py3

from urllib.parse import urlparse, urlunparse, quote
def myquote(url):
parts = urlparse(url)
return urlunparse(parts._replace(path=quote(parts.path)))


>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/~user/with%20space/index.html?a=1&b=2'

Py2

import urlparse, urllib
def myquote(url):
parts = urlparse.urlparse(url)
return urlparse.urlunparse(parts[:2] + (urllib.quote(parts[2]),) + parts[3:])


>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/%7Euser/with%20space/index.html?a=1&b=2'

这只引用了路径组件。

小开

使用 urllib.quote或 urllib.quote_plus

来自 Urllib 文档:

Quote (string [ ，safe ])

替换字符串中的特殊字符使用“% xx”转义符, 数字和字符“ _.-”是从来没有引用。可选的保险箱参数指定其他不应引用的字符 ——它的默认值是’/’。

示例: quote('/~connolly/')产生 '/%7econnolly/'。

Quote _ plus (string [ ，safe ])

像引号() ，但也替换空格加号引用表单值。在转义原始字符串，除非它们都包含在保险箱里没有“/”的安全默认值。

编辑: 在整个网址上使用 urllib.quote 或 urllib.quote _ plus 会破坏它，就像@惊 ΩD _ _ _ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

>>> quoted_url = urllib.quote('http://www.example.com/foo goo/bar.html')
>>> quoted_url
'http%3A//www.example.com/foo%20goo/bar.html'
>>> urllib2.urlopen(quoted_url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python25\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "c:\python25\lib\urllib2.py", line 373, in open
protocol = req.get_type()
File "c:\python25\lib\urllib2.py", line 244, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http%3A//www.example.com/foo%20goo/bar.html

Γ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _.这可能对您更有用，尽管如果您正在从一个已知的协议和主机构建 URL，但使用可疑的路径，您可能也可以避免 urlparse，只引用 URL 的可疑部分，并连接已知的安全部分。

小开

看一下这个模块: Werkzeug.utils(现在在 werkzeug.urls中)

您正在寻找的函数名为“ url _ fix”，其工作原理如下:

>>> from werkzeug.urls import url_fix
>>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'

它在 Werkzeug 的实施情况如下:

import urllib
import urlparse


def url_fix(s, charset='utf-8'):
"""Sometimes you get an URL by a user that just isn't a real
URL because it contains unsafe characters like ' ' and so on.  This
function can fix some of the problems in a similar way browsers
handle data entered by the user:


>>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'


:param charset: The target charset for the URL if the url was
given as unicode string.
"""
if isinstance(s, unicode):
s = s.encode(charset, 'ignore')
scheme, netloc, path, qs, anchor = urlparse.urlsplit(s)
path = urllib.quote(path, '/%')
qs = urllib.quote_plus(qs, ':&=')
return urlparse.urlunsplit((scheme, netloc, path, qs, anchor))

小开

在 Python 2.7中真正解决了这个问题

正确的解决办法是:

 # percent encode url, fixing lame server errors for e.g, like space
# within url paths.
fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")

有关详细信息，请参阅问题918368: “ urllib 不更正服务器返回的 urls”

小开

因为这个页面是 Google 关于这个主题的搜索结果中最好的一个，所以我认为值得一提的是 Python 在 URL 规范化方面所做的一些工作，这些工作超越了 urlencode 空格字符。例如，处理默认端口、字符大小写、缺少尾部斜杠等。

在开发 Atom 联合格式时，有一些关于如何将 URL 规范化为规范格式的讨论; 这在 Atom/Pie wiki 上的 PaceCanonicalIds文章中进行了说明。那篇文章提供了一些很好的测试用例。

我相信这次讨论的一个结果是 Mark Nottingham 的 Urlnorm.py库，我在几个项目中使用它取得了良好的效果。但是，这个脚本不能处理这个问题中给出的 URL。因此，一个更好的选择可能是 Sam Ruby 版本的 urlnor.py，它处理这个 URL，以及来自 Atom wiki 的上述所有测试用例。

小开

仅供参考，urlnorm 已经搬到了 github: Http://gist.github.com/246089

小开

我遇到了这样一个问题: 只需要引用空格。

fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")确实有用，但是太复杂了。

所以我用了一个简单的方法: url = url.replace(' ', '%20')，它不是完美的，但是它是最简单的方法，它适用于这种情况。

小开

对 Python 3.5有效:

import urllib.parse


urllib.parse.quote([your_url], "\./_-:")

例如:

import urllib.parse


print(urllib.parse.quote("http://www.example.com/foo goo/bar.html", "\./_-:"))

输出为 http://www.example.com/foo%20goo/bar.html

字体: https://docs.python.org/3.5/library/urllib.parse.html?highlight=quote#urllib.parse.quote

小开

这里的很多答案都是引用 URL，而不是 正常化。

在 PythonIMO 中对 url 进行规范化(用于重复数据删除等)的最佳工具是 W3lib的 w3lib.url.canonicalize_url util。

摘自官方文件:

Canonicalize the given url by applying the following procedures:


- sort query arguments, first by key, then by value
percent encode paths ; non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
- percent encode query arguments ; non-ASCII characters are percent-encoded using passed encoding (UTF-8 by default)
- normalize all spaces (in query arguments) ‘+’ (plus symbol)
- normalize percent encodings case (%2f -> %2F)
- remove query arguments with blank values (unless keep_blank_values is True)
- remove fragments (unless keep_fragments is True)
- List item


The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).


>>> import w3lib.url
>>>
>>> # sorting query arguments
>>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'
>>>
>>> # UTF-8 conversion + percent-encoding of non-ASCII characters
>>> w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'

我已经使用这个工具，非常成功的时候广泛爬网，以避免重复的要求，因为细微的网址差异(不同的参数顺序，锚等)