使用熊猫 read_csv 时出现内存错误

我试图做一些相当简单的事情,将一个大型的 csv 文件读入熊猫数据框。

data = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2)

代码要么因为 MemoryError而失败,要么就永远不会结束。

任务管理器中的内存使用量停止在506Mb,5分钟后没有变化,进程中没有 CPU 活动,我停止了它。

我使用的是熊猫版本0.11.0。

我知道文件解析器曾经存在一个内存问题,但是根据 译自: 美国《 http://wesmckinney.com/blog/?p=543》杂志网站(http://wesmckinney.com/blog/,这个问题应该已经被修复了。

我正在尝试读取的文件是366 Mb,上面的代码工作,如果我把文件减少到一些短(25 Mb)。

还有一次,我收到一个弹出窗口,告诉我它不能写地址0x1e0baf93..。

堆栈跟踪:

Traceback (most recent call last):
File "F:\QA ALM\Python\new WIM data\new WIM data\new_WIM_data.py", line 25, in
<module>
wimdata = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2
)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 401, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 216, in _read
return parser.read()
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 643, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 394, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 525, in _init_dict
dtype=dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 5338, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1820, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1872, in form_blocks
float_blocks = _multi_blockify(float_items, items)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1930, in _multi_blockify
block_items, values = _stack_arrays(list(tup_block), ref_items, dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1962, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError
Press any key to continue . . .

一点背景知识——我试图让人们相信 Python 可以做与 R 相同的事情。为此,我试图复制一个可以做到这一点的 R 脚本

data <- read.table(paste(INPUTDIR,config[i,]$TOEXTRACT,sep=""), HASHEADER, DELIMITER,skip=2,fill=TRUE)

R 不仅能够很好地读取上面的文件,它甚至在 for 循环中读取其中的几个文件(然后对数据进行一些处理)。如果 Python 对这种大小的文件有问题,我可能会打一场失败的战斗..。

109825 次浏览

Although this is a workaround not so much as a fix, I'd try converting that CSV to JSON (should be trivial) and using read_json method instead - I've been writing and reading sizable JSON/dataframes (100s of MB) in Pandas this way without any problem at all.

There is no error for Pandas 0.12.0 and NumPy 1.8.0.

I have managed to create a big DataFrame and save it to a csv file and then successfully read it. Please see the example here. The size of the file is 554 Mb (It even worked for 1.1 Gb file, took longer, to generate 1.1Gb file use frequency of 30 seconds). Though I have 4Gb of RAM available.

My suggestion is try updating Pandas. Other thing that could be useful is try running your script from command line, because for R you are not using Visual Studio (this already was suggested in the comments to your question), hence it has more resources available.

I encountered this issue as well when I was running in a virtual machine, or somewere else where the memory is stricktly limited. It has nothing to do with pandas or numpy or csv, but will always happen if you try using more memory as you are alowed to use, not even only in python.

The only chance you have is what you already tried, try to chomp down the big thing into smaller pieces which fit into memory.

If you ever asked yourself what MapReduce is all about, you found out by yourself...MapReduce would try to distribute the chunks over many machines, you would try to process the chunke on one machine one after another.

What you found out with the concatenation of the chunk files might be an issue indeed, maybe there are some copy needed in this operation...but in the end this maybe saves you in your current situation but if your csv gets a little bit larger you might run against that wall again...

It also could be, that pandas is so smart, that it actually only loads the individual data chunks into memory if you do something with it, like concatenating to a big df?

Several things you can try:

  • Don't load all the data at once, but split in in pieces
  • As far as I know, hdf5 is able to do these chunks automatically and only loads the part your program currently works on
  • Look if the types are ok, a string '0.111111' needs more memory than a float
  • What do you need actually, if there is the adress as a string, you might not need it for numerical analysis...
  • A database can help acessing and loading only the parts you actually need (e.g. only the 1% active users)

I use Pandas on my Linux box and faced many memory leaks that only got resolved after upgrading Pandas to the latest version after cloning it from github.

Windows memory limitation

Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.

Tricks for lowering memory usage

If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.

The pandas.read_csv function takes an option called dtype. This lets pandas know what types exist inside your csv data.

How this works

By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.

Example

Let's say your csv looks like this:

name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01

This example is of course no problem to read into memory, but it's just an example.

If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.

I think the default in pandas is to read 1,000,000 rows before guessing the dtype.

Solution

By specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.

Problem with corrupt data

However, if your csv file would be corrupted, like this:

name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
Dennis, 40+, None-Ur-Bz

Then specifying dtype={'age':int} will break the .read_csv() command, because it cannot cast "40+" to int. So sanitize your data carefully!

Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:

Try it yourself

df = pd.DataFrame(pd.np.random.choice(['1.0', '0.6666667', '150000.1'],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 224544 (~224 MB)


df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 79560 (~79 MB)

I had the same memory problem with a simple read of a tab delimited text file around 1 GB in size (over 5.5 million records) and this solved the memory problem:

df = pd.read_csv(myfile,sep='\t') # didn't work, memory error
df = pd.read_csv(myfile,sep='\t',low_memory=False) # worked fine and in less than 30 seconds

Spyder 3.2.3 Python 2.7.13 64bits

I tried chunksize while reading big CSV file

reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)

The read is now the list. We can iterate the reader and write/append to the new csv or can perform any operation

for chunk in reader:
print(newChunk.columns)
print("Chunk -> File process")
with open(destination, 'a') as f:
newChunk.to_csv(f, header=False,sep='\t',index=False)
print("Chunk appended to the file")

Add these: ratings = pd.read_csv(..., low_memory=False, memory_map=True)

My memory with these two: #319.082.496 Without these two: #349.110.272