S3 urls-获取桶名和路径

我有一个变量,它有 aws s3 url

s3://bucket_name/folder1/folder2/file1.json

我想在一个变量中获取 bucket _ name,在另一个变量中休息,即/folder1/folder2/file1.json。我尝试了正则表达式,可以得到下面这样的 bucket _ name,但不确定是否有更好的方法。

m = re.search('(?<=s3:\/\/)[^\/]+', 's3://bucket_name/folder1/folder2/file1.json')
print(m.group(0))

如何获得其余的,即 e-folder1/folder2/file1.json?

我已经检查过 boto3是否有一个特性可以从 url 中提取 bucket _ name 和 key,但是找不到它。

103552 次浏览

Since it's just a normal URL, you can use urlparse to get all the parts of the URL.

>>> from urlparse import urlparse
>>> o = urlparse('s3://bucket_name/folder1/folder2/file1.json', allow_fragments=False)
>>> o
ParseResult(scheme='s3', netloc='bucket_name', path='/folder1/folder2/file1.json', params='', query='', fragment='')
>>> o.netloc
'bucket_name'
>>> o.path
'/folder1/folder2/file1.json'

You may have to remove the beginning slash from the key as the next answer suggests.

o.path.lstrip('/')

With Python 3 urlparse moved to urllib.parse so use:

from urllib.parse import urlparse

Here's a class that takes care of all the details.

try:
from urlparse import urlparse
except ImportError:
from urllib.parse import urlparse




class S3Url(object):
"""
>>> s = S3Url("s3://bucket/hello/world")
>>> s.bucket
'bucket'
>>> s.key
'hello/world'
>>> s.url
's3://bucket/hello/world'


>>> s = S3Url("s3://bucket/hello/world?qwe1=3#ddd")
>>> s.bucket
'bucket'
>>> s.key
'hello/world?qwe1=3#ddd'
>>> s.url
's3://bucket/hello/world?qwe1=3#ddd'


>>> s = S3Url("s3://bucket/hello/world#foo?bar=2")
>>> s.key
'hello/world#foo?bar=2'
>>> s.url
's3://bucket/hello/world#foo?bar=2'
"""


def __init__(self, url):
self._parsed = urlparse(url, allow_fragments=False)


@property
def bucket(self):
return self._parsed.netloc


@property
def key(self):
if self._parsed.query:
return self._parsed.path.lstrip('/') + '?' + self._parsed.query
else:
return self._parsed.path.lstrip('/')


@property
def url(self):
return self._parsed.geturl()

If you want to do it with regular expressions, you can do the following:

>>> import re
>>> uri = 's3://my-bucket/my-folder/my-object.png'
>>> match = re.match(r's3:\/\/(.+?)\/(.+)', uri)
>>> match.group(1)
'my-bucket'
>>> match.group(2)
'my-folder/my-object.png'

This has the advantage that you can check for the s3 scheme rather than allowing anything there.

For those who like me was trying to use urlparse to extract key and bucket in order to create object with boto3. There's one important detail: remove slash from the beginning of the key

from urlparse import urlparse
o = urlparse('s3://bucket_name/folder1/folder2/file1.json')
bucket = o.netloc
key = o.path
boto3.client('s3')
client.put_object(Body='test', Bucket=bucket, Key=key.lstrip('/'))

It took a while to realize that because boto3 doesn't throw any exception.

A solution that works without urllib or re (also handles preceding slash):

def split_s3_path(s3_path):
path_parts=s3_path.replace("s3://","").split("/")
bucket=path_parts.pop(0)
key="/".join(path_parts)
return bucket, key

To run:

bucket, key = split_s3_path("s3://my-bucket/some_folder/another_folder/my_file.txt")

Returns:

bucket: my-bucket
key: some_folder/another_folder/my_file.txt

Here it is as a one-liner using regex:

import re


s3_path = "s3://bucket/path/to/key"


bucket, key = re.match(r"s3:\/\/(.+?)\/(.+)", s3_path).groups()

This is a nice project:

s3path is a pathlib extention for aws s3 service

>>> from s3path import S3Path
>>> path = S3Path.from_uri('s3://bucket_name/folder1/folder2/file1.json')
>>> print(path.bucket)
'/bucket_name'
>>> print(path.key)
'folder1/folder2/file1.json'
>>> print(list(path.key.parents))
[S3Path('folder1/folder2'), S3Path('folder1'), S3Path('.')]

Pretty easy to accomplish with a single line of builtin string methods...

s3_filepath = "s3://bucket-name/and/some/key.txt"
bucket, key = s3_filepath.replace("s3://", "").split("/", 1)

I use the following regex:

^(?:[s|S]3:\/\/)?([a-zA-Z0-9\._-]+)(?:\/)(.+)$

If match, then S3 parsed parts as follows:

  • match group1 => S3 bucket name
  • match group2 => S3 object name

This pattern handles bucket path with or without s3:// uri prefix.

If want to allow other legal bucket name chars, modify [a-zA-Z0-9_-] part of pattern to include other chars as needed.

Complete JS example (in Typescript form)

const S3_URI_PATTERN = '^(?:[s|S]3:\\/\\/)?([a-zA-Z0-9\\._-]+)(?:\\/)(.+)$';


export interface S3UriParseResult {
bucket: string;
name: string;
}


export class S3Helper {
/**
*
* @param uri
*/
static parseUri(uri: string): S3UriParseResult {
const re = new RegExp(S3_URI_PATTERN);
const match = re.exec(uri);
if (!match || (match && match.length !== 3)) {
throw new Error('Invalid S3 object URI');
}
return {
bucket: match[1],
name: match[2],
};
}
}


A more recent option is to use cloudpathlib, which implements pathlib functions for files on cloud services (including S3, Google Cloud Storage and Azure Blob Storage).

In addition to those functions, it's easy to get the bucket and the key for your S3 paths.

from cloudpathlib import S3Path


path = S3Path("s3://bucket_name/folder1/folder2/file1.json")


path.bucket
#> 'bucket_name'


path.key
#> 'folder1/folder2/file1.json'

The simplest I do is:

s = 's3://bucket/path1/path2/file.txt'
s1 = s.split('/', 3)
bucket = s1[2]
object_key = s1[3]

This can be done smooth

bucket_name, key = s3_uri[5:].split('/', 1)