Boto3 to download all files from a S3 Bucket

小开

Amazon S3没有文件夹/目录，它是一个 平面文件结构平面文件结构。

要维护目录的外观，请使用 路径名称存储为对象 Key 的一部分(文件名)。例如:

images/foo.jpg

在这种情况下，整个密钥是 images/foo.jpg，而不仅仅是 foo.jpg。

我怀疑您的问题是 boto正在返回一个名为 my_folder/.8Df54234的文件，并试图将其保存到本地文件系统。但是，本地文件系统将 my_folder/部分解释为目录名和 该目录在本地文件系统上不存在。

您可以使用截短文件名只保存 .8Df54234部分，也可以在编写文件之前使用 创建必要的目录文件名。注意，它可以是多级嵌套目录。

一个更简单的方法是使用 AWS Command-Line Interface (CLI)，它会为你做所有这些工作，例如:

aws s3 cp --recursive s3://my_bucket_name local_folder

还有一个 sync选项，它只复制新的和修改过的文件。

小开

我目前正在完成这项任务，通过使用以下方法

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='bucket')['Contents']
for s3_key in list:
s3_object = s3_key['Key']
if not s3_object.endswith("/"):
s3.download_file('bucket', s3_object, s3_object)
else:
import os
if not os.path.exists(s3_object):
os.makedirs(s3_object)

虽然，它的工作，我不知道这样做的好处。我把它留在这里是为了帮助其他用户和进一步的答案，以更好的方式实现这一点

小开

我有相同的需求，并创建了以下函数，递归下载的文件。

import boto3
import os


def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
if result.get('CommonPrefixes') is not None:
for subdir in result.get('CommonPrefixes'):
download_dir(client, resource, subdir.get('Prefix'), local, bucket)
for file in result.get('Contents', []):
dest_pathname = os.path.join(local, file.get('Key'))
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
if not file.get('Key').endswith('/'):
resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)

The function is called that way:

def _start():
client = boto3.client('s3')
resource = boto3.resource('s3')
download_dir(client, resource, 'clientconf/', '/tmp', bucket='my-bucket')

小开

一次性获取所有文件是一个非常糟糕的主意，你应该分批获取。

我用来从 S3获取特定文件夹(目录)的一个实现是,

def get_directory(directory_path, download_path, exclude_file_names):
# prepare session
session = Session(aws_access_key_id, aws_secret_access_key, region_name)
    

# get instances for resource and bucket
resource = session.resource('s3')
bucket = resource.Bucket(bucket_name)


for s3_key in self.client.list_objects(Bucket=self.bucket_name, Prefix=directory_path)['Contents']:
s3_object = s3_key['Key']
if s3_object not in exclude_file_names:
bucket.download_file(file_path, download_path + str(s3_object.split('/')[-1])

如果你仍然想得到整个桶使用它通过 CLI 作为 @ John Rotenstein 提到如下,

aws s3 cp --recursive s3://bucket_name download_path

小开

import os
import boto3


#initiate s3 resource
s3 = boto3.resource('s3')


# select bucket
my_bucket = s3.Bucket('my_bucket_name')


# download file into current directory
for s3_object in my_bucket.objects.all():
# Need to split s3_object.key into path and file name, else it will give error file not found.
path, filename = os.path.split(s3_object.key)
my_bucket.download_file(s3_object.key, filename)

小开

迟到总比不到好:)用分页符的前一个答案真的很好。然而，它是递归的，并且您最终可能会达到 Python 的递归限制。这里有一个替代方法，一对夫妇额外的检查。

import os
import errno
import boto3




def assert_dir_exists(path):
"""
Checks if directory tree in path exists. If not it created them.
:param path: the path to check if it exists
"""
try:
os.makedirs(path)
except OSError as e:
if e.errno != errno.EEXIST:
raise




def download_dir(client, bucket, path, target):
"""
Downloads recursively the given S3 path to the target directory.
:param client: S3 client to use.
:param bucket: the name of the bucket to download from
:param path: The S3 directory to download.
:param target: the local directory to download the files to.
"""


# Handle missing / at end of prefix
if not path.endswith('/'):
path += '/'


paginator = client.get_paginator('list_objects_v2')
for result in paginator.paginate(Bucket=bucket, Prefix=path):
# Download each file individually
for key in result['Contents']:
# Calculate relative path
rel_path = key['Key'][len(path):]
# Skip paths ending in /
if not key['Key'].endswith('/'):
local_file_path = os.path.join(target, rel_path)
# Make sure directories exist
local_file_dir = os.path.dirname(local_file_path)
assert_dir_exists(local_file_dir)
client.download_file(bucket, key['Key'], local_file_path)




client = boto3.client('s3')


download_dir(client, 'bucket-name', 'path/to/data', 'downloads')

小开

我有一个解决方案，可以在同一个过程中运行 AWS CLI。

将 awscli安装为 python lib:

pip install awscli

然后定义这个函数:

from awscli.clidriver import create_clidriver


def aws_cli(*cmd):
old_env = dict(os.environ)
try:


# Environment
env = os.environ.copy()
env['LC_CTYPE'] = u'en_US.UTF'
os.environ.update(env)


# Run awscli in the same process
exit_code = create_clidriver().main(*cmd)


# Deal with problems
if exit_code > 0:
raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
finally:
os.environ.clear()
os.environ.update(old_env)

To execute:

aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')

小开

for objs in my_bucket.objects.all():
print(objs.key)
path='/tmp/'+os.sep.join(objs.key.split(os.sep)[:-1])
try:
if not os.path.exists(path):
os.makedirs(path)
my_bucket.download_file(objs.key, '/tmp/'+objs.key)
except FileExistsError as fe:
print(objs.key+' exists')

此代码将下载 /tmp/目录中的内容。如果需要，可以更改目录。

小开

最佳答案

当处理具有1000多个对象的存储桶时，实现一个在最多1000个键的连续集上使用 NextContinuationToken的解决方案是必要的。这个解决方案首先编译一个对象列表，然后迭代地创建指定的目录并下载现有的对象。

import boto3
import os


s3_client = boto3.client('s3')


def download_dir(prefix, local, bucket, client=s3_client):
"""
params:
- prefix: pattern to match in s3
- local: local path to folder in which to place files
- bucket: s3 bucket with target contents
- client: initialized s3 client object
"""
keys = []
dirs = []
next_token = ''
base_kwargs = {
'Bucket':bucket,
'Prefix':prefix,
}
while next_token is not None:
kwargs = base_kwargs.copy()
if next_token != '':
kwargs.update({'ContinuationToken': next_token})
results = client.list_objects_v2(**kwargs)
contents = results.get('Contents')
for i in contents:
k = i.get('Key')
if k[-1] != '/':
keys.append(k)
else:
dirs.append(k)
next_token = results.get('NextContinuationToken')
for d in dirs:
dest_pathname = os.path.join(local, d)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
for k in keys:
dest_pathname = os.path.join(local, k)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
client.download_file(bucket, k, dest_pathname)

小开

如果您想使用 python 调用 bash 脚本，这里有一个简单的方法，可以将文件从 S3 bucket 中的文件夹加载到本地文件夹(在 Linux 机器中) :

import boto3
import subprocess
import os


###TOEDIT###
my_bucket_name = "your_my_bucket_name"
bucket_folder_name = "your_bucket_folder_name"
local_folder_path = "your_local_folder_path"
###TOEDIT###


# 1.Load thes list of files existing in the bucket folder
FILES_NAMES = []
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('{}'.format(my_bucket_name))
for object_summary in my_bucket.objects.filter(Prefix="{}/".format(bucket_folder_name)):
#     print(object_summary.key)
FILES_NAMES.append(object_summary.key)


# 2.List only new files that do not exist in local folder (to not copy everything!)
new_filenames = list(set(FILES_NAMES )-set(os.listdir(local_folder_path)))


# 3.Time to load files in your destination folder
for new_filename in new_filenames:
upload_S3files_CMD = """aws s3 cp s3://{}/{}/{} {}""".format(my_bucket_name,bucket_folder_name,new_filename ,local_folder_path)


subprocess_call = subprocess.call([upload_S3files_CMD], shell=True)
if subprocess_call != 0:
print("ALERT: loading files not working correctly, please re-check new loaded files")

小开

我得到了类似的要求，并从阅读上述解决方案和其他网站的帮助，我想出了以下脚本，只是想分享，如果它可以帮助任何人。

from boto3.session import Session
import os


def sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path):
session = Session(aws_access_key_id=access_key_id,aws_secret_access_key=secret_access_key)
s3 = session.resource('s3')
your_bucket = s3.Bucket(bucket_name)
for s3_file in your_bucket.objects.all():
if folder in s3_file.key:
file=os.path.join(destination_path,s3_file.key.replace('/','\\'))
if not os.path.exists(os.path.dirname(file)):
os.makedirs(os.path.dirname(file))
your_bucket.download_file(s3_file.key,file)
sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path)

小开

重新发布@glefait 的回答，最后加上一个 if 条件，以避免 OS 错误20。它获得的第一个键是文件夹名称本身，该名称不能写入目标路径。

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
if result.get('CommonPrefixes') is not None:
for subdir in result.get('CommonPrefixes'):
download_dir(client, resource, subdir.get('Prefix'), local, bucket)
for file in result.get('Contents', []):
print("Content: ",result)
dest_pathname = os.path.join(local, file.get('Key'))
print("Dest path: ",dest_pathname)
if not os.path.exists(os.path.dirname(dest_pathname)):
print("here last if")
os.makedirs(os.path.dirname(dest_pathname))
print("else file key: ", file.get('Key'))
if not file.get('Key') == dist:
print("Key not equal? ",file.get('Key'))
resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)enter code here

小开

I have been running into this problem for a while and with all of the different forums I've been through I haven't see a full end-to-end snip-it of what works. So, I went ahead and took all the pieces (add some stuff on my own) and have created a full end-to-end S3 Downloader!

这不仅会自动下载文件，而且如果 S3文件在子目录中，它会在本地存储上创建它们。在我的应用程序的实例中，我需要设置权限和所有者，所以我也添加了(如果不需要可以注释掉)。

这已经在 Docker 环境(K8)中进行了测试和工作，但是我已经在脚本中添加了环境变量，以防您想在本地测试/运行它。

我希望这可以帮助某些人找到 S3下载自动化。我也欢迎任何建议，信息等，如何可以更好地优化，如果需要的话。

#!/usr/bin/python3
import gc
import logging
import os
import signal
import sys
import time
from datetime import datetime


import boto
from boto.exception import S3ResponseError
from pythonjsonlogger import jsonlogger


formatter = jsonlogger.JsonFormatter('%(message)%(levelname)%(name)%(asctime)%(filename)%(lineno)%(funcName)')


json_handler_out = logging.StreamHandler()
json_handler_out.setFormatter(formatter)


#Manual Testing Variables If Needed
#os.environ["DOWNLOAD_LOCATION_PATH"] = "some_path"
#os.environ["BUCKET_NAME"] = "some_bucket"
#os.environ["AWS_ACCESS_KEY"] = "some_access_key"
#os.environ["AWS_SECRET_KEY"] = "some_secret"
#os.environ["LOG_LEVEL_SELECTOR"] = "DEBUG, INFO, or ERROR"


#Setting Log Level Test
logger = logging.getLogger('json')
logger.addHandler(json_handler_out)
logger_levels = {
'ERROR' : logging.ERROR,
'INFO' : logging.INFO,
'DEBUG' : logging.DEBUG
}
logger_level_selector = os.environ["LOG_LEVEL_SELECTOR"]
logger.setLevel(logger_level_selector)


#Getting Date/Time
now = datetime.now()
logger.info("Current date and time : ")
logger.info(now.strftime("%Y-%m-%d %H:%M:%S"))


#Establishing S3 Variables and Download Location
download_location_path = os.environ["DOWNLOAD_LOCATION_PATH"]
bucket_name = os.environ["BUCKET_NAME"]
aws_access_key_id = os.environ["AWS_ACCESS_KEY"]
aws_access_secret_key = os.environ["AWS_SECRET_KEY"]
logger.debug("Bucket: %s" % bucket_name)
logger.debug("Key: %s" % aws_access_key_id)
logger.debug("Secret: %s" % aws_access_secret_key)
logger.debug("Download location path: %s" % download_location_path)


#Creating Download Directory
if not os.path.exists(download_location_path):
logger.info("Making download directory")
os.makedirs(download_location_path)


#Signal Hooks are fun
class GracefulKiller:
kill_now = False
def __init__(self):
signal.signal(signal.SIGINT, self.exit_gracefully)
signal.signal(signal.SIGTERM, self.exit_gracefully)
def exit_gracefully(self, signum, frame):
self.kill_now = True


#Downloading from S3 Bucket
def download_s3_bucket():
conn = boto.connect_s3(aws_access_key_id, aws_access_secret_key)
logger.debug("Connection established: ")
bucket = conn.get_bucket(bucket_name)
logger.debug("Bucket: %s" % str(bucket))
bucket_list = bucket.list()
#    logger.info("Number of items to download: {0}".format(len(bucket_list)))


for s3_item in bucket_list:
key_string = str(s3_item.key)
logger.debug("S3 Bucket Item to download: %s" % key_string)
s3_path = download_location_path + "/" + key_string
logger.debug("Downloading to: %s" % s3_path)
local_dir = os.path.dirname(s3_path)


if not os.path.exists(local_dir):
logger.info("Local directory doesn't exist, creating it... %s" % local_dir)
os.makedirs(local_dir)
logger.info("Updating local directory permissions to %s" % local_dir)
#Comment or Uncomment Permissions based on Local Usage
os.chmod(local_dir, 0o775)
os.chown(local_dir, 60001, 60001)
logger.debug("Local directory for download: %s" % local_dir)
try:
logger.info("Downloading File: %s" % key_string)
s3_item.get_contents_to_filename(s3_path)
logger.info("Successfully downloaded File: %s" % s3_path)
#Updating Permissions
logger.info("Updating Permissions for %s" % str(s3_path))
#Comment or Uncomment Permissions based on Local Usage
os.chmod(s3_path, 0o664)
os.chown(s3_path, 60001, 60001)
except (OSError, S3ResponseError) as e:
logger.error("Fatal error in s3_item.get_contents_to_filename", exc_info=True)
# logger.error("Exception in file download from S3: {}".format(e))
continue
logger.info("Deleting %s from S3 Bucket" % str(s3_item.key))
s3_item.delete()


def main():
killer = GracefulKiller()
while not killer.kill_now:
logger.info("Checking for new files on S3 to download...")
download_s3_bucket()
logger.info("Done checking for new files, will check in 120s...")
gc.collect()
sys.stdout.flush()
time.sleep(120)
if __name__ == '__main__':
main()

小开

来自 AWS S3 Docs (如何在 S3 bucket 中使用文件夹?) :

在 AmazonS3中，桶和对象是主要资源，对象存储在桶中。Amazon S3使用 扁平结构代替文件系统中的层次结构。但是，为了简化组织结构，Amazon S3控制台支持将文件夹概念作为对对象分组的一种方法。Amazon S3通过使用对象的共享名称前缀(即，对象的名称以公共字符串开头)来实现这一点。对象名也称为键名。

例如，您可以在控制台上创建一个名为 photos 的文件夹，并在其中存储一个名为 myphoto.jpg 的对象。然后将对象存储在键名 photos/myphoto.jpg 中，其中 photos/是前缀。

要将所有文件从“ mybucket”下载到工作目录 尊重 bucket 的 < em > 仿真的目录结构中(如果这些文件在本地不存在，则从 bucket 中创建文件夹) :

import boto3
import os


bucket_name = "mybucket"
s3 = boto3.client("s3")
objects = s3.list_objects(Bucket = bucket_name)["Contents"]
for s3_object in objects:
s3_key = s3_object["Key"]
path, filename = os.path.split(s3_key)
if len(path) != 0 and not os.path.exists(path):
os.makedirs(path)
if not s3_key.endswith("/"):
download_to = path + '/' + filename if path else filename
s3.download_file(bucket_name, s3_key, download_to)

小开

import boto3, os


s3 = boto3.client('s3')


def download_bucket(bucket):
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket)
for page in pages:
if 'Contents' in page:
for obj in page['Contents']:
os.path.dirname(obj['Key']) and os.makedirs(os.path.dirname(obj['Key']), exist_ok=True)
try:
s3.download_file(bucket, obj['Key'], obj['Key'])
except NotADirectoryError:
pass


# Change bucket_name to name of bucket that you want to download
download_bucket(bucket_name)

这应该适用于所有数量的对象(也适用于超过1000个的对象)。每个分页器页面最多可以包含1000个对象。请注意 os.makdirs 函数中的额外参数 -确实存在，它导致在路径存在时不抛出错误)

小开

很多解决方案都很复杂。如果您正在寻找一些更简单的东西，cloudpathlib以一种很好的方式包装了这个用例，它将下载目录或文件。

from cloudpathlib import CloudPath


cp = CloudPath("s3://bucket/product/myproject/2021-02-15/")
cp.download_to("local_folder")

注意: 对于包含大量文件的大文件夹，命令行中的 awscli可能更快。

小开

我已经更新了格兰特的回答并行运行，如果有人感兴趣，它会快得多:

from concurrent import futures
import os
import boto3


def download_dir(prefix, local, bucket):


client = boto3.client('s3')


def create_folder_and_download_file(k):
dest_pathname = os.path.join(local, k)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
print(f'downloading {k} to {dest_pathname}')
client.download_file(bucket, k, dest_pathname)


keys = []
dirs = []
next_token = ''
base_kwargs = {
'Bucket': bucket,
'Prefix': prefix,
}
while next_token is not None:
kwargs = base_kwargs.copy()
if next_token != '':
kwargs.update({'ContinuationToken': next_token})
results = client.list_objects_v2(**kwargs)
contents = results.get('Contents')
for i in contents:
k = i.get('Key')
if k[-1] != '/':
keys.append(k)
else:
dirs.append(k)
next_token = results.get('NextContinuationToken')
for d in dirs:
dest_pathname = os.path.join(local, d)
if not os.path.exists(os.path.dirname(dest_pathname)):
os.makedirs(os.path.dirname(dest_pathname))
with futures.ThreadPoolExecutor() as executor:
futures.wait(
[executor.submit(create_folder_and_download_file, k) for k in keys],
return_when=futures.FIRST_EXCEPTION,
)

小开

又一个使用异步/aioboto 的并行下载程序

import os, time
import asyncio
from itertools import chain
import json
from typing import List
from json.decoder import WHITESPACE
import logging
from functools import partial
from pprint import pprint as pp


# Third Party
import asyncpool
import aiobotocore.session
import aiobotocore.config


_NUM_WORKERS = 50




bucket_name= 'test-data'
bucket_prefix= 'etl2/test/20210330/f_api'




async def save_to_file(s3_client, bucket: str, key: str):
    

response = await s3_client.get_object(Bucket=bucket, Key=key)
async with response['Body'] as stream:
content = await stream.read()
    

if 1:
fn =f'out/downloaded/{bucket_name}/{key}'


dn= os.path.dirname(fn)
if not isdir(dn):
os.makedirs(dn,exist_ok=True)
if 1:
with open(fn, 'wb') as fh:
fh.write(content)
print(f'Downloaded to: {fn}')
   

return [0]




async def go(bucket: str, prefix: str) -> List[dict]:
"""
Returns list of dicts of object contents


:param bucket: s3 bucket
:param prefix: s3 bucket prefix
:return: list of download statuses
"""
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()


session = aiobotocore.session.AioSession()
config = aiobotocore.config.AioConfig(max_pool_connections=_NUM_WORKERS)
contents = []
async with session.create_client('s3', config=config) as client:
worker_co = partial(save_to_file, client, bucket)
async with asyncpool.AsyncPool(None, _NUM_WORKERS, 's3_work_queue', logger, worker_co,
return_futures=True, raise_on_join=True, log_every_n=10) as work_pool:
# list s3 objects using paginator
paginator = client.get_paginator('list_objects')
async for result in paginator.paginate(Bucket=bucket, Prefix=prefix):
for c in result.get('Contents', []):
contents.append(await work_pool.push(c['Key'], client))


# retrieve results from futures
contents = [c.result() for c in contents]
return list(chain.from_iterable(contents))




def S3_download_bucket_files():
s = time.perf_counter()
_loop = asyncio.get_event_loop()
_result = _loop.run_until_complete(go(bucket_name, bucket_prefix))
assert sum(_result)==0, _result
print(_result)
elapsed = time.perf_counter() - s
print(f"{__file__} executed in {elapsed:0.2f} seconds.")

它将首先从 S3获取文件列表，然后使用 aioboto 和 _ NUM _ WORKERS = 50从网络并行读取数据进行下载。

小开

There are very minor differences in the way S3 organizes files and the way Windows does. Here is a simple self-documenting example that accounts for those differences.

另外: 可以将 amazon 文件名看作一个普通的字符串。它们并不代表一个文件夹。Amazon SIMULATES 文件夹，因此，如果您试图将一个文件放入系统中不存在的文件夹的名称中，它就无法找到放置该文件的位置。因此，必须为 S3中的每个模拟文件夹在系统上创建一个文件夹。如果你有一个文件夹在一个文件夹，不要使用“ mkdir (路径)”它不会工作。你必须使用“ makdirs (path)”。还有一件事！- > PC 文件路径的格式很奇怪。Amazon 使用“/”，pc 使用“”，对于整个文件名必须是相同的。请查看下面的代码块(其中也显示了身份验证)。

In my example, I have a folder in my bucket called "iTovenGUIImages/gui_media". I want to put it in a folder on my system that MAY not exist yet. The folder on my system has it's own special prefix that can be whatever you want in your system as long as it's formatted like a folder path.

import boto3
import cred
import os


locale_file_Imagedirectory = r"C:\\Temp\\App Data\\iToven AI\\"  # This is where all GUI files for iToven AI exist on PC




def downloadImageDirectoryS3(remoteDirectoryName, desired_parent_folder):
my_bucket = 'itovenbucket'
s3_resource = boto3.resource('s3', aws_access_key_id=cred.AWSAccessKeyId,
aws_secret_access_key=cred.AWSSecretKey)
bucket = s3_resource.Bucket(my_bucket)
for obj in bucket.objects.filter(Prefix=remoteDirectoryName):
pcVersionPrefix = remoteDirectoryName.replace("/", r"\\")
isolatedFileName = obj.key.replace(remoteDirectoryName, "")
clientSideFileName = desired_parent_folder+pcVersionPrefix+isolatedFileName
print(clientSideFileName)  # Client-Side System File Structure
if not os.path.exists(desired_parent_folder+pcVersionPrefix):  # CREATE DIRECTORIES FOR EACH FOLDER RECURSIVELY
os.makedirs(desired_parent_folder+pcVersionPrefix)
if obj.key not in desired_parent_folder+pcVersionPrefix:
bucket.download_file(obj.key, clientSideFileName)  # save to new path




downloadImageDirectoryS3(r"iTovenGUIImagesPC/gui_media/", locale_file_Imagedirectory)