如何通过HTTP下载文件？

小开

最佳答案

使用#0：

import urllib.requestwith urllib.request.urlopen('http://www.example.com/') as f:html = f.read().decode('utf-8')

这是使用库的最基本方法，不包括任何错误处理。您还可以执行更复杂的操作，例如更改标头。

在Python 2上，该方法在#0中：

import urllib2response = urllib2.urlopen('http://www.example.com/')html = response.read()

小开

import urllib2mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")with open('test.mp3','wb') as output:output.write(mp3file.read())

open('test.mp3','wb')中的wb以二进制模式打开文件（并擦除任何现有文件），因此您可以使用它保存数据，而不仅仅是文本。

小开

我同意Corey的观点，urllib2比urllib更完整，如果你想做更复杂的事情，应该是使用的模块，但是为了让答案更完整，如果你只想要基础知识，urllib是一个更简单的模块：

import urllibresponse = urllib.urlopen('http://www.example.com/sound.mp3')mp3 = response.read()

可以正常工作。或者，如果你不想处理“响应”对象，你可以直接调用read（）：

import urllibmp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()

小开

还有一个，使用#0：

import urllib.requesturllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

（对于Python 2使用import urllib和urllib.urlretrieve）

小开

2012年，使用python请求库

>>> import requests>>>>>> url = "http://download.thinkbroadband.com/10MB.zip">>> r = requests.get(url)>>> print len(r.content)10485760

您可以运行pip install requests来获取它。

请求比替代方案有许多优势，因为API要简单得多。如果您必须进行身份验证，尤其如此。在这种情况下，urllib和urllib2非常不直观且痛苦。

2015-12-30

人们对进度条表示钦佩。它很酷，当然。现在有几个现成的解决方案，包括tqdm：

from tqdm import tqdmimport requests
url = "http://download.thinkbroadband.com/10MB.zip"response = requests.get(url, stream=True)
with open("10MB", "wb") as handle:for data in tqdm(response.iter_content()):handle.write(data)

这基本上是30个月前@kvance描述的实现。

小开

Python 2/3的PabloG代码的改进版本：

#!/usr/bin/env python# -*- coding: utf-8 -*-from __future__ import ( division, absolute_import, print_function, unicode_literals )
import sys, os, tempfile, logging
if sys.version_info >= (3,):import urllib.request as urllib2import urllib.parse as urlparseelse:import urllib2import urlparse
def download_file(url, dest=None):"""Download and save a file specified by url to dest directory,"""u = urllib2.urlopen(url)
scheme, netloc, path, query, fragment = urlparse.urlsplit(url)filename = os.path.basename(path)if not filename:filename = 'downloaded.file'if dest:filename = os.path.join(dest, filename)
with open(filename, 'wb') as f:meta = u.info()meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_allmeta_length = meta_func("Content-Length")file_size = Noneif meta_length:file_size = int(meta_length[0])print("Downloading: {0} Bytes: {1}".format(url, file_size))
file_size_dl = 0block_sz = 8192while True:buffer = u.read(block_sz)if not buffer:break
file_size_dl += len(buffer)f.write(buffer)
status = "{0:16}".format(file_size_dl)if file_size:status += "   [{0:6.2f}%]".format(file_size_dl * 100 / file_size)status += chr(13)print(status, end="")print()
return filename
if __name__ == "__main__":  # Only run if this file is called directlyprint("Testing with 10MB download")url = "http://download.thinkbroadband.com/10MB.zip"filename = download_file(url)print(filename)

小开

为此目的用纯Python编写了wget库。从2.0版开始，它用这些功能注入了urlretrieve。

小开

这可能有点晚了，但我看到了pabloG的代码，忍不住添加了一个os.system（'cls'），让它看起来很棒！看看：

    import urllib2,os
url = "http://download.thinkbroadband.com/10MB.zip"
file_name = url.split('/')[-1]u = urllib2.urlopen(url)f = open(file_name, 'wb')meta = u.info()file_size = int(meta.getheaders("Content-Length")[0])print "Downloading: %s Bytes: %s" % (file_name, file_size)os.system('cls')file_size_dl = 0block_sz = 8192while True:buffer = u.read(block_sz)if not buffer:break
file_size_dl += len(buffer)f.write(buffer)status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)status = status + chr(8)*(len(status)+1)print status,
f.close()

如果在Windows以外的环境中运行，则必须使用cls以外的东西。在MAC OS X和Linux中，它应该是“清晰的”。

小开

源代码可以是：

import urllibsock = urllib.urlopen("http://diveintopython.org/")htmlSource = sock.read()sock.close()print htmlSource

小开

您也可以使用url检索获取进度反馈：

def report(blocknr, blocksize, size):current = blocknr*blocksizesys.stdout.write("\r{0:.2f}%".format(100.0*current/size))
def downloadFile(url):print "\n",urlfname = url.split('/')[-1]print fnameurllib.urlretrieve(url, fname, report)

小开

使用wget模块：

import wgetwget.download('url')

小开

python3

#0

import urllib.requestresponse = urllib.request.urlopen('http://www.example.com/')html = response.read()

urllib.request.urlretrieve
```
import urllib.requesturllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
```
备注：根据留档，urllib.request.urlretrieve是一个“遗留接口”，“将来可能会被弃用”（感谢gerrit）

python2

#0（感谢科里）

import urllib2response = urllib2.urlopen('http://www.example.com/')html = response.read()

urllib.urlretrieve (thanks PabloG)

import urlliburllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')

小开

如果您安装了wget，则可以使用parallel_sync。

pip安装parallel_sync

from parallel_sync import wgeturls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']wget.download('/tmp', urls)# or a single file:wget.download('/tmp', urls[0], filenames='x.zip', extract=True)

文档：https://pythonhosted.org/parallel_sync/pages/examples.html

这非常强大。它可以并行下载文件，失败时重试，甚至可以在远程机器上下载文件。

小开

以下是在python中下载文件最常用的调用：

urllib.urlretrieve ('url_to_file', file_name)

urllib2.urlopen('url_to_file')

requests.get(url)

wget.download('url', file_name)

注意：urlopen和urlretrieve在下载大文件（大小>500 MB）时表现相对较差。requests.get将文件存储在内存中，直到下载完成。

小开

URL检索和requests.get很简单，但现实并非如此。我已经获取了几个站点的数据，包括文本和图像，上述两个可能解决了大部分任务。但是对于更通用的解决方案，我建议使用urlopen。由于它包含在Python 3标准库中，您的代码可以在任何运行Python 3的机器上运行，而无需预先安装站点包

import urllib.requesturl_request = urllib.request.Request(url, headers=headers)url_connect = urllib.request.urlopen(url_request)
#remember to open file in bytes modewith open(filename, 'wb') as f:while True:buffer = url_connect.read(buffer_size)if not buffer: break
#an integer value of size of written datadata_wrote = f.write(buffer)
#you could probably use with-open-as mannerurl_connect.close()

这个答案提供了使用Python通过超文本传输协议下载文件时HTTP 403禁止的解决方案。我只尝试过请求和urllib模块，其他模块可能会提供更好的东西，但这是我用来解决大多数问题的模块。

小开

我写了以下内容，适用于vanilla Python 2或Python 3。

import systry:import urllib.requestpython3 = Trueexcept ImportError:import urllib2python3 = False

def progress_callback_simple(downloaded,total):sys.stdout.write("\r" +(len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +" [%3.2f%%]"%(100.0*float(downloaded)/float(total)))sys.stdout.flush()
def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):def _download_helper(response, out_file, file_size):if progress_callback!=None: progress_callback(0,file_size)if block_size == None:buffer = response.read()out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size,file_size)else:file_size_dl = 0while True:buffer = response.read(block_size)if not buffer: break
file_size_dl += len(buffer)out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size_dl,file_size)with open(dstfilepath,"wb") as out_file:if python3:with urllib.request.urlopen(srcurl) as response:file_size = int(response.getheader("Content-Length"))_download_helper(response,out_file,file_size)else:response = urllib2.urlopen(srcurl)meta = response.info()file_size = int(meta.getheaders("Content-Length")[0])_download_helper(response,out_file,file_size)
import tracebacktry:download("https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip","output.zip",progress_callback_simple)except:traceback.print_exc()input()

备注：

支持“进度条”回调。
从我的网站下载一个4 MB的测试. zip。

小开

简单但Python 2 & Python 3兼容的方式附带six库：

from six.moves import urlliburllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

小开

如果速度对你很重要，我对模块urllib和wget做了一个小性能测试，关于wget，我尝试了一次状态栏，一次没有。我用三个不同的500MB文件进行测试（不同的文件-以消除在引擎盖下进行缓存的机会）。在debian机器上测试，使用python2。

首先，这些是结果（它们在不同的运行中相似）：

$ python wget_test.pyurlretrive_test : startingurlretrive_test : 6.56==============wget_no_bar_test : startingwget_no_bar_test : 7.20==============wget_with_bar_test : starting100% [......................................................................] 541335552 / 541335552wget_with_bar_test : 50.49==============

我执行测试的方式是使用“配置文件”装饰器。这是完整的代码：

import wgetimport urllibimport timefrom functools import wraps
def profile(func):@wraps(func)def inner(*args):print func.__name__, ": starting"start = time.time()ret = func(*args)end = time.time()print func.__name__, ": {:.2f}".format(end - start)return retreturn inner
url1 = 'http://host.com/500a.iso'url2 = 'http://host.com/500b.iso'url3 = 'http://host.com/500c.iso'
def do_nothing(*args):pass
@profiledef urlretrive_test(url):return urllib.urlretrieve(url)
@profiledef wget_no_bar_test(url):return wget.download(url, out='/tmp/', bar=do_nothing)
@profiledef wget_with_bar_test(url):return wget.download(url, out='/tmp/')
urlretrive_test(url1)print '=============='time.sleep(1)
wget_no_bar_test(url2)print '=============='time.sleep(1)
wget_with_bar_test(url3)print '=============='time.sleep(1)

urllib似乎是最快的

小开

在python3中，您可以使用urllib3和Shutil库。使用pip或pip3下载它们（取决于python3是否默认）

pip3 install urllib3 shutil

然后运行此代码

import urllib.requestimport shutil
url = "http://www.somewebsite.com/something.pdf"output_file = "save_this_name.pdf"with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:shutil.copyfileobj(response, out_file)

请注意，您下载了urllib3，但在代码中使用了urllib

小开

您可以在Python 2和3上使用pycurl。

import pycurl
FILE_DEST = 'pycurl.html'FILE_SRC = 'http://pycurl.io/'
with open(FILE_DEST, 'wb') as f:c = pycurl.Curl()c.setopt(c.URL, FILE_SRC)c.setopt(c.WRITEDATA, f)c.perform()c.close()

小开

为了完整起见，也可以使用subprocess包调用任何用于检索文件的程序。专用于检索文件的程序比urlretrieve等Python函数更强大。例如，#2可以递归下载目录（-R），可以处理FTP、重定向、HTTP代理，可以避免重新下载现有文件（-nc），#5可以进行多连接下载，这可能会加快您的下载速度。

import subprocesssubprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])

在Jupyter Notebook中，也可以直接使用!语法调用程序：

!wget -O example_output_file.html https://example.com

小开

import os,requestsdef download(url):get_response = requests.get(url,stream=True)file_name  = url.split("/")[-1]with open(file_name, 'wb') as f:for chunk in get_response.iter_content(chunk_size=1024):if chunk: # filter out keep-alive new chunksf.write(chunk)

download("https://example.com/example.jpg")

小开

延迟回答，但对于python>=3.6，您可以使用：

import dloaddload.save(url)

安装dload：

pip3 install dload

小开

我想从网页下载所有文件。我尝试了wget，但它失败了，所以我决定使用Python路由，并找到了这个线程。

读完后，我做了一个小小的命令行应用程序#0，扩展了PabloG和Stan的优秀答案，并添加了一些有用的选项。

它使用福汤收集页面的所有URL，然后下载具有所需扩展名的URL。最后它可以并行下载多个文件。

这里是：

#!/usr/bin/env python3# -*- coding: utf-8 -*-from __future__ import (division, absolute_import, print_function, unicode_literals)import sys, os, argparsefrom bs4 import BeautifulSoup
# --- insert Stan's script here ---# if sys.version_info >= (3,):#...#...# def download_file(url, dest=None):#...#...
# --- new stuff ---def collect_all_url(page_url, extensions):"""Recovers all links in page_url checking for all the desired extensions"""conn = urllib2.urlopen(page_url)html = conn.read()soup = BeautifulSoup(html, 'lxml')links = soup.find_all('a')
results = []for tag in links:link = tag.get('href', None)if link is not None:for e in extensions:if e in link:# Fallback for badly defined links# checks for missing scheme or netlocif bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):results.append(link)else:new_url=urlparse.urljoin(page_url,link)results.append(new_url)return results
if __name__ == "__main__":  # Only run if this file is called directly# Command line argumentsparser = argparse.ArgumentParser(description='Download all files from a webpage.')parser.add_argument('-u', '--url',help='Page url to request')parser.add_argument('-e', '--ext',nargs='+',help='Extension(s) to find')parser.add_argument('-d', '--dest',default=None,help='Destination where to save the files')parser.add_argument('-p', '--par',action='store_true', default=False,help="Turns on parallel download")args = parser.parse_args()
# Recover files to downloadall_links = collect_all_url(args.url, args.ext)
# Downloadif not args.par:for l in all_links:try:filename = download_file(l, args.dest)print(l)except Exception as e:print("Error while downloading: {}".format(e))else:from multiprocessing.pool import ThreadPoolresults = ThreadPool(10).imap_unordered(lambda x: download_file(x, args.dest), all_links)for p in results:print(p)

其用法的一个例子是：

python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>

一个实际的例子，如果你想看到它在行动：

python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics

小开

另一种方法是调用外部进程，例如curl.exe.Curl默认显示进度条、平均下载速度、剩余时间等，所有这些都整齐地格式化在表格中。将curl.exe放在与脚本相同的目录中

from subprocess import callurl = ""call(["curl", {url}, '--output', "song.mp3"])

注意：您不能使用curl指定输出路径，因此请在之后执行os.rename

小开

新的基于Api urllib3的实现

>>> import urllib3>>> http = urllib3.PoolManager()>>> r = http.request('GET', 'your_url_goes_here')>>> r.status200>>> r.data*****Response Data****

更多信息：https://pypi.org/project/urllib3/

小开

在5行中使用Python请求

import requests as req
remote_url = 'http://www.example.com/sound.mp3'local_file_name = 'sound.mp3'
data = req.get(remote_url)
# Save file data to local copywith open(local_file_name, 'wb')as file:file.write(data.content)

现在对远程文件的本地副本做些什么

小开

另一种可能性是内置#0：

from http import HTTPStatus, clientfrom shutil import copyfileobj
# using httpsconnection = client.HTTPSConnection("www.example.com")with connection.request("GET", "/noise.mp3") as response:if response.status == HTTPStatus.OK:copyfileobj(response, open("noise.mp3")else:raise Exception("request needs work")

HTTPConnection对象被认为是“低级”的，因为它只执行一次所需的请求，并假设开发人员将以处理HTTP细微差别的方式对其或脚本进行子类化。像要求这样的库倾向于处理更特殊的情况，例如自动跟踪重定向等。

小开

您可以使用keras.utils.get_file来执行此操作：

from tensorflow import keras
path_to_downloaded_file = keras.utils.get_file(fname="file name",origin="https://www.linktofile.com/link/to/file",extract=True,archive_format="zip",  # downloaded file formatcache_dir="/",  # cache and extract in current directory)