大熊猫中的大型持久数据框架

作为一个长期的 SAS 用户,我正在探索转换到 python 和熊猫。

然而,当今天运行一些测试时,我感到惊讶的是 python 在尝试 pandas.read_csv()一个128mb 的 csv 文件时内存不足。它有大约200,000行和200列主要是数字数据。

使用 SAS,我可以导入一个 csv 文件到 SAS 数据集,它可以像我的硬盘驱动器一样大。

pandas中有类似的东西吗?

我经常处理大型文件,无法访问分布式计算网络。

79417 次浏览

In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).

At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.

Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:

import pandas as pd


tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000)  # gives TextFileReader, which is iterable with chunks of 1000 rows.
df = pd.concat(tp, ignore_index=True)  # df is DataFrame. If errors, do `list(tp)` instead of `tp`

This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).

Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:

import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess


# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'


table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration


# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']


# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0])


# connect to database
cnx = sqlite3.connect(out_sqlite)


# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):


df = pd.read_csv(in_csv,
header=None,  # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i)   # skip rows that were already read


# columns to read
df.columns = columns


sql.to_sql(df,
name=table_name,
con=cnx,
index=False, # don't use CSV file index
index_label='molecule_id', # use a unique column from DataFrame as index
if_exists='append')
cnx.close()

You can use Pytable rather than pandas df. It is designed for large data sets and the file format is in hdf5. So the processing time is relatively fast.

Below is my working flow.

import sqlalchemy as sa
import pandas as pd
import psycopg2


count = 0
con = sa.create_engine('postgresql://postgres:pwd@localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

Base on your file size, you'd better optimized the chunksize.

 for chunk in chunks:
chunk.to_sql(name='Table', if_exists='append', con=con)
count += 1
print(count)

After have all data in Database, You can query out those you need from database.

If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas

link to dask on github