使用 python 和 BeautifulSoup 从网页中检索链接

小开

import urllib2
import BeautifulSoup


request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
if 'national-park' in a['href']:
print 'found a url with national-park in the link'

小开

最佳答案

下面是在 BeautifulSoup 中使用 SoupSiner 类的一个简短片段:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer


http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')


for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])

BeautifulSoup 文档实际上相当不错，涵盖了许多典型场景:

Https://www.crummy.com/software/beautifulsoup/bs4/doc/

编辑: 注意，如果您事先知道要解析什么内容，那么我使用 SoupSander 类是因为它更有效率(在内存和速度方面)。

小开

只是为了得到链接，没有 B 汤和正则表达式:

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
if "<a href" in item:
try:
ind = item.index(tag)
item=item[ind+len(tag):]
end=item.index(endtag)
except: pass
else:
print item[:end]

对于更复杂的操作，当然 BSoup 仍然是首选。

小开

也有人推荐 BeautifulSoup，但是使用 Lxml要好得多。不管它的名字是什么，它也是用来解析和抓取 HTML 的。它比 BeautifulSoup 快得多得多，它甚至比 BeautifulSoup (他们声称的名声)更好地处理“破碎的”HTML。如果您不想学习 lxml API，也可以使用它为 BeautifulSoup 提供兼容性 API。

伊恩 · 布利金对此表示赞同。

没有理由再使用 BeautifulSoup 了，除非你使用的是 Google App Engine 或者其他不允许使用 Python 的软件。

Lxml.html 还支持 CSS3选择器，所以这种事情是微不足道的。

Lxml 和 xpath 的示例如下所示:

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')


dom =  lxml.html.fromstring(connection.read())


for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
print link

小开

为什么不使用正则表达式:

import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
print('href: %s, HTML text: %s' % (link[0], link[1]))

小开

BeautifulSoup 现在使用 lxml。

import requests
import lxml.html


dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)


[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

在 list comp 中，“ if’//’和‘ url.com’not In x”是一个简单的方法，可以擦除站点的内部导航 URL 的 URL 列表，等等。

小开

以下代码是使用 urllib2和 BeautifulSoup4检索网页中所有可用的链接:

import urllib2
from bs4 import BeautifulSoup


url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)


for line in soup.find_all('a'):
print(line.get('href'))

小开

为了完整起见，BeautifulSoup 4版本也利用了服务器提供的编码:

from bs4 import BeautifulSoup
import urllib.request


parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))


for link in soup.find_all('a', href=True):
print(link['href'])

或 Python 2版本:

from bs4 import BeautifulSoup
import urllib2


parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))


for link in soup.find_all('a', href=True):
print link['href']

以及一个使用 requests库的版本，这个版本可以在 Python 2和 Python 3中使用:

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests


parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)


for link in soup.find_all('a', href=True):
print(link['href'])

soup.find_all('a', href=True)调用查找所有具有 href属性的 <a>元素; 跳过不具有该属性的元素。

BeautifulSoup 3在2012年3月停止了开发; 新项目应该总是使用 BeautifulSoup 4。

请注意，您应该保留从字节 敬美丽的汤解码 HTML。你可以通知 BeautifulSoup 在 HTTP 响应头中找到的字符集来帮助解码，但是这个可以是错误的，并且与在 HTML 本身中找到的 <meta>头信息冲突，这就是为什么上面使用 BeautifulSoup 内部类方法 EncodingDetector.find_declared_encoding()来确保这样的嵌入式编码提示战胜配置错误的服务器。

对于 requests，如果响应具有 text/* imetype，那么即使没有返回任何字符集，response.encoding属性也默认为 Latin1。这与 HTTP RFC 是一致的，但是在与 HTML 解析一起使用时会很痛苦，所以当 Content-Type 标头中没有设置 charset时，应该忽略该属性。

小开

import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

小开

这个脚本可以完成您所需要的工作，但是也可以解析到绝对链接的相对链接。

import urllib
import lxml.html
import urlparse


def get_dom(url):
connection = urllib.urlopen(url)
return lxml.html.fromstring(connection.read())


def get_links(url):
return resolve_links((link for link in get_dom(url).xpath('//a/@href')))


def guess_root(links):
for link in links:
if link.startswith('http'):
parsed_link = urlparse.urlparse(link)
scheme = parsed_link.scheme + '://'
netloc = parsed_link.netloc
return scheme + netloc


def resolve_links(links):
root = guess_root(links)
for link in links:
if not link.startswith('http'):
link = urlparse.urljoin(root, link)
yield link


for link in get_links('http://www.google.com'):
print link

小开

要查找所有链接，我们将在本例中一起使用 urllib2模块使用 re.module * re 模块中最强大的函数之一是“ re.findall ()”。 Search ()用于查找模式的第一个匹配项，re.findall ()查找所有并将它们作为字符串列表返回，每个字符串表示一个匹配项 *

import urllib2


import re
#connect to a URL
website = urllib2.urlopen(url)


#read html code
html = website.read()


#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)


print links

小开

BeatifulSoup 自己的解析器可能比较慢。使用能够直接从 URL 解析的 < a href = “ http://lxml.de/”rel = “ nofollow noReferrer”> lxml 可能更加可行(下面提到了一些限制)。

import lxml.html


doc = lxml.html.parse(url)


links = doc.xpath('//a[@href]')


for link in links:
print link.attrib['href']

上面的代码将返回原来的链接，在大多数情况下，它们是相对链接或来自站点根的绝对链接。由于我的用例只提取某种类型的链接，下面是一个版本，它将链接转换为完整的 URL，并且可以选择接受像 *.mp3这样的全局模式。虽然它不能处理相对路径中的单点和双点，但是到目前为止我还不需要它。如果您需要解析包含 ../或 ./的 URL 片段，那么 Urlparse.urljoin可能会派上用场。

注意 : Direct lxml url 解析不处理来自 https的加载，也不执行重定向，因此出于这个原因，下面的版本使用的是 urllib2 + lxml。

#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch


try:
import urltools as urltools
except ImportError:
sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
urltools = None




def get_host(url):
p = urlparse.urlparse(url)
return "{}://{}".format(p.scheme, p.netloc)




if __name__ == '__main__':
url = sys.argv[1]
host = get_host(url)
glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'


doc = lxml.html.parse(urllib2.urlopen(url))
links = doc.xpath('//a[@href]')


for link in links:
href = link.attrib['href']


if fnmatch.fnmatch(href, glob_patt):


if not href.startswith(('http://', 'https://' 'ftp://')):


if href.startswith('/'):
href = host + href
else:
parent_url = url.rsplit('/', 1)[0]
href = urlparse.urljoin(parent_url, href)


if urltools:
href = urltools.normalize(href)


print href

用法如下:

getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

小开

下面是一个使用@ars 接受答案和 BeautifulSoup4、 requests和 wget模块处理下载的示例。

import requests
import wget
import os


from bs4 import BeautifulSoup, SoupStrainer


url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'


response = requests.get(url)


for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)

小开

我通过@Blairg23找到了答案，经过下面的修正(包括它不能正确工作的场景) :

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
wget.download(full_path)

对于 Python 3:

必须使用 urllib.parse.urljoin才能获得完整的 URL。

小开

链接可以包含在各种属性中，因此您可以将这些属性的列表传递给 select。

例如，使用 src和 href属性(这里我使用 start with ^ 操作符来指定这两个属性值中的任何一个以 http 开头) :

from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)

属性 = 值选择器

[ attr ^ = value ]

表示属性名为 attr 的元素，该元素的值以 value 作为前缀(前缀)。

还有常用的 $(以)和 *(包含)运算符。有关完整语法列表，请参见上面的链接。

小开

可以有许多重复的链接连同外部和内部链接。为了区分两者，使用集合得到唯一的链接:

# Python 3.
import urllib
from bs4 import BeautifulSoup


url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
link = line.get('href')
if not link:
continue
if link.startswith('http'):
external_links.add(link)
else:
internal_links.add(link)


# Depending on usage, full internal links may be preferred.
full_internal_links = {
urllib.parse.urljoin(url, internal_link)
for internal_link in internal_links
}


# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
print(link)