如何使用 list_Objects_v2从 S3获取超过1000个对象?

我在 s3上有超过50万个对象。我试图得到每个物体的大小。我使用下面的 Python 代码来实现这一点

import boto3


bucket = 'bucket'
prefix = 'prefix'


contents = boto3.client('s3').list_objects_v2(Bucket=bucket,  MaxKeys=1000, Prefix=prefix)["Contents"]


for c in contents:
print(c["Size"])

但它只给了我前1000件物品的大小。根据文件记录,我们最多只能拿到1000份。有什么办法能让我得到更多吗?

69820 次浏览

Use the ContinuationToken returned in the response as a parameter for subsequent calls, until the IsTruncated value returned in the response is false.

This can be factored into a neat generator function:

def get_all_s3_objects(s3, **base_kwargs):
continuation_token = None
while True:
list_kwargs = dict(MaxKeys=1000, **base_kwargs)
if continuation_token:
list_kwargs['ContinuationToken'] = continuation_token
response = s3.list_objects_v2(**list_kwargs)
yield from response.get('Contents', [])
if not response.get('IsTruncated'):  # At the end of the list?
break
continuation_token = response.get('NextContinuationToken')


for file in get_all_s3_objects(boto3.client('s3'), Bucket=bucket, Prefix=prefix):
print(file['Size'])

If you don't NEED to use the boto3.client you can use boto3.resource to get a complete list of your files:

s3r = boto3.resource('s3')
bucket = s3r.Bucket('bucket_name')
files_in_bucket = list(bucket.objects.all())

Then to get the size just:

sizes = [f.size for f in files_in_bucket]

Depending on the size of your bucket this might take a minute.

The inbuilt boto3 Paginator class is the easiest way to overcome the 1000 record limitation of list-objects-v2. This can be implemented as follows

s3 = boto3.client('s3')


paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='bucket', Prefix='prefix')


for page in pages:
for obj in page['Contents']:
print(obj['Size'])

For more details: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Paginator.ListObjectsV2