熊猫数据框前几行读法

是否有一种内置的方法使用 read_csv只读取一个文件的第一个 n行而不提前知道行的长度?我有一个很大的文件,需要很长的时间来阅读,有时只想使用第一个,比如说,20行获得它的样本(并宁愿不加载完整的东西,采取它的头)。

If I knew the total number of lines I could do something like footer_lines = total_lines - n and pass this to the skipfooter keyword arg. My current solution is to manually grab the first n lines with python and StringIO it to pandas:

import pandas as pd
from StringIO import StringIO


n = 20
with open('big_file.csv', 'r') as f:
head = ''.join(f.readlines(n))


df = pd.read_csv(StringIO(head))

It's not that bad, but is there a more concise, 'pandasic' (?) way to do it with keywords or something?

162440 次浏览

我认为你可以使用 nrows参数:

nrows : int, default None


Number of rows of file to read. Useful for reading pieces of large files

使用一个标准的大型测试文件(988504479字节,5344499行) :

In [1]: import pandas as pd


In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s


In [3]: len(z)
Out[3]: 20


In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s

I would use 'skiprows' argument in read_csv, e.g.,:

df = pd.read_csv(filename, skiprows=range(2, 20000), nrows=10000)