如何从 python 3中的 URL 读取 html

我看了以前类似的问题,只是更加困惑。

在 python 3.4中,给定 url,我希望将 html 页面读取为字符串。

在 perl 中,我使用 LWP: : Simple,使用 get ()来完成这项工作。

Matplotlib 1.3.1示例说明: import urllib; u1=urllib.urlretrieve(url)。 Python3找不到 urlretrieve

我试了 u1 = urllib.request.urlopen(url),它似乎得到一个 HTTPResponse对象,但我不能打印它或得到它的长度或索引它。

u1.body不存在,我在 python3中找不到对 HTTPResponse的描述。

HTTPResponse对象中是否有一个属性可以给出 html 页面的原始字节?

(来自其他问题的不相关内容包括 urllib2,它不存在于我的 python、 csv 解析器等中。)

编辑:

我在之前的一个问题中发现了一些东西,它们部分(大部分)起到了作用:

u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1')


for lines in u2.readlines():
print (lines)

我说“部分”是因为我不想读取单独的行,而只是一个大字符串。

我可以只是连接行,但每一行打印有一个字符“ b”预先对它。

这是怎么回事?

同样,我想我可以在连接之前删除第一个字符,但这确实是一个 kloodge。

232886 次浏览

urllib.request.urlopen(url).read() should return you the raw HTML page as a string.

For python 2

import urllib
some_url = 'https://docs.python.org/2/library/urllib.html'
filehandle = urllib.urlopen(some_url)
print filehandle.read()

Note that Python3 does not read the html code as a string but as a bytearray, so you need to convert it to one with decode.

import urllib.request


fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()


mystr = mybytes.decode("utf8")
fp.close()


print(mystr)

Try the 'requests' module, it's much simpler.

#pip install requests for installation


import requests


url = 'https://www.google.com/'
r = requests.get(url)
r.text

more info here > http://docs.python-requests.org/en/master/

import requests


url = requests.get("http://yahoo.com")
htmltext = url.text
print(htmltext)

This will work similar to urllib.urlopen.

Reading an html page with urllib is fairly simple to do. Since you want to read it as a single string I will show you.

Import urllib.request:

#!/usr/bin/python3.5


import urllib.request

Prepare our request

request = urllib.request.Request('http://www.w3schools.com')

Always use a "try/except" when requesting a web page as things can easily go wrong. urlopen() requests the page.

try:
response = urllib.request.urlopen(request)
except:
print("something wrong")

Type is a great function that will tell us what 'type' a variable is. Here, response is a http.response object.

print(type(response))

The read function for our response object will store the html as bytes to our variable. Again type() will verify this.

htmlBytes = response.read()


print(type(htmlBytes))

Now we use the decode function for our bytes variable to get a single string.

htmlStr = htmlBytes.decode("utf8")


print(type(htmlStr))

If you do want to split up this string into separate lines, you can do so with the split() function. In this form we can easily iterate through to print out the entire page or do any other processing.

htmlSplit = htmlStr.split('\n')


print(type(htmlSplit))


for line in htmlSplit:
print(line)

Hopefully this provides a little more detailed of an answer. Python documentation and tutorials are great, I would use that as a reference because it will answer most questions you might have.