How to import a text file on AWS S3 into pandas without writing to disk

我有一个文本文件保存在 S3上,这是一个制表符分隔的表。我想加载到熊猫,但不能保存它第一,因为我在一个 heroku 服务器上运行。这是我目前掌握的情况。

import io
import boto3
import os
import pandas as pd


os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"


s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]




pd.read_csv(file, header=14, delimiter="\t", low_memory=False)

the error is

OSError: Expected file path name or file-like object, got <class 'bytes'> type

我如何将响应机构转换成熊猫会接受的格式?

pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)


returns


TypeError: initial_value must be str or None, not StreamingBody


pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)


returns


TypeError: 'StreamingBody' does not support the buffer interface

更新-使用以下工作

file = response["Body"].read()

and

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
147551 次浏览

pandas uses boto for read_csv, so you should be able to:

import boto
data = pd.read_csv('s3://bucket....csv')

If you need boto3 because you are on python3.4+, you can

import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))

由于 版本0.20.1 pandas使用 s3fs,请参阅 回答如下

使用 S3fs,它可以按照以下步骤进行:

import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)


# CSV
with fs.open('mybucket/path/to/object/foo.pkl') as f:
df = pd.read_csv(f)


# Pickle
with fs.open('mybucket/path/to/object/foo.pkl') as f:
df = pd.read_pickle(f)

现在是 pandas can handle S3 URLs。你可以简单地做:

import pandas as pd
import s3fs


df = pd.read_csv('s3://bucket-name/file.csv')

如果没有,需要安装 s3fs

认证

如果您的 S3存储桶是私有的并且需要身份验证,那么您有两个选择:

将访问凭据添加到 ~/.aws/credentials config file

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Or

2-设置以下 环境变量的适当值:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token

由于文件可能太大,因此将它们全部加载到数据框中是不明智的。因此,逐行读取并保存在数据框中。是的,我们也可以在 read _ csv 中提供块大小,但是我们必须保持读取的行数。

Hence, I came up with this engineering:

def create_file_object_for_streaming(self):
print("creating file object for streaming")
self.file_object = self.bucket.Object(key=self.package_s3_key)
print("File object is: " + str(self.file_object))
print("Object file created.")
return self.file_object


for row in codecs.getreader(self.encoding)(self.response[u'Body']).readlines():
row_string = StringIO(row)
df = pd.read_csv(row_string, sep=",")

一旦工作完成,我也会删除 df。 del df

An option is to convert the csv to json via df.to_dict() and then store it as a string. Note this is only relevant if the CSV is not a requirement but you just want to quickly put the dataframe in an S3 bucket and retrieve it again.

from boto.s3.connection import S3Connection
import pandas as pd
import yaml


conn = S3Connection()
mybucket = conn.get_bucket('mybucketName')
myKey = mybucket.get_key("myKeyName")


myKey.set_contents_from_string(str(df.to_dict()))

这将把 df 转换成一个 dict 字符串,然后在 S3中将其保存为 json。稍后您可以使用相同的 json 格式阅读:

df = pd.DataFrame(yaml.load(myKey.get_contents_as_string()))

其他的解决方案也不错,但是这个要简单一些。Yaml 可能不一定是必需的,但是您需要一些东西来解析 json 字符串。如果 S3文件不一定 需要是一个 CSV,这可以是一个快速修复。

这现在在最新的熊猫中得到了支持

http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files

例如:

df = pd.read_csv('s3://pandas-test/tips.csv')

对于文本文件,可以使用以下代码和管道分隔的文件,例如:-

import pandas as pd
import io
import boto3
s3_client = boto3.client('s3', use_ssl=False)
bucket = #
prefix = #
obj = s3_client.get_object(Bucket=bucket, Key=prefix+ filename)
df = pd.read_fwf((io.BytesIO(obj['Body'].read())) , encoding= 'unicode_escape', delimiter='|', error_bad_lines=False,header=None, dtype=str)

对于 python 3.6 + 亚马逊现在有一个非常好的库来使用熊猫和他们的服务,称为 锯木匠

import awswrangler as wr
import boto3




# Boto3 session
session = boto3.session.Session(aws_access_key_id='XXXX',
aws_secret_access_key='XXXX')


# Awswrangler pass forward all pd.read_csv() function args
df = wr.s3.read_csv(path='s3://bucket/path/',
boto3_session=session,
skiprows=2,
sep=';',
decimal=',',
na_values=['--'])

安装 awswrangler: pip install awswrangler

import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(profile='<profile_name>')
pd.read_csv(s3.open(<s3_path>))