在 Python 3中为 urrlib.request.urlopen 更改用户代理

我想打开一个网址使用 urllib.request.urlopen('someurl'):

with urllib.request.urlopen('someurl') as url:
b = url.read()

我一直得到以下错误:

urllib.error.HTTPError: HTTP Error 403: Forbidden

我理解这个错误是由于站点不允许 python 访问它,以防止机器人浪费他们的网络资源-这是可以理解的。我进行了搜索,发现您需要更改 urllib 的用户代理。然而,我找到的关于如何更改用户代理的所有指南和解决方案都使用了 urllib2,我使用的是 python3,所以所有的解决方案都不能正常工作。

我如何用 python3解决这个问题?

80215 次浏览

From the Python docs:

import urllib.request
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)


f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
from urllib.request import urlopen, Request


urlopen(Request(url, headers={'User-Agent': 'Mozilla'}))

The host site rejection is coming from the OWASP ModSecurity Core Rules for Apache mod-security. Rule 900002 has a list of "bad" user agents, and one of them is "python-urllib2". That's why requests with the default user agent fail.

Unfortunately, if you use Python's "robotparser" function,

https://docs.python.org/3.5/library/urllib.robotparser.html?highlight=robotparser#module-urllib.robotparser

it uses the default Python user agent, and there's no parameter to change that. If "robotparser"'s attempt to read "robots.txt" is refused (not just URL not found), it then treats all URLs from that site as disallowed.

I just answered a similar question here: https://stackoverflow.com/a/43501438/206820

In case you just not only want to open the URL, but also want to download the resource(say, a PDF file), you can use the code as below:

    # proxy = ProxyHandler({'http': 'http://192.168.1.31:8888'})
proxy = ProxyHandler({})
opener = build_opener(proxy)
opener.addheaders = [('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30')]
install_opener(opener)


result = urlretrieve(url=file_url, filename=file_name)

The reason I added proxy is to monitor the traffic in Charles, and here is the traffic I got:

See the User-Agent