分割一个大熊猫数据框架

小开

最佳答案

使用 np.array_split:

Docstring:
Split an array into multiple sub-arrays.


Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.

In [1]: import pandas as pd


In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...:                           'foo', 'bar', 'foo', 'foo'],
...:                    'B' : ['one', 'one', 'two', 'three',
...:                           'two', 'two', 'one', 'three'],
...:                    'C' : randn(8), 'D' : randn(8)})


In [3]: print df
A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468


In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]:
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]

小开

注意:

np.array_split不能用于 numpy-1.9.0。

错误:

数据框架没有“ size”属性

小开

我想做同样的事情，首先我遇到了拆分函数的问题，然后是安装熊猫0.15.2的问题，所以我回到了我的旧版本，并且编写了一个非常好用的小函数。希望这个能帮上忙！

# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller chunks
def split_dataframe(df, chunk_size = 10000):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks

小开

请注意，np.array_split(df, 3)将数据帧分割成3个子数据帧，而 @ 长生不老药的答案中定义的 split_dataframe函数在调用为 split_dataframe(df, chunk_size=3)时将数据帧分割成每个 chunk_size行。

例如:

np.array_split:

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11], columns=['TEST'])
df_split = np.array_split(df, 3)

你会得到三个子数据帧:

df_split[0] # 1, 2, 3, 4
df_split[1] # 5, 6, 7, 8
df_split[2] # 9, 10, 11

split_dataframe:

df_split2 = split_dataframe(df, chunk_size=3)

你会得到4个子数据帧:

df_split2[0] # 1, 2, 3
df_split2[1] # 4, 5, 6
df_split2[2] # 7, 8, 9
df_split2[3] # 10, 11

希望我是对的，这个有用。

小开

您可以使用 groupby，假设您有一个整数枚举索引:

import math
df = pd.DataFrame(dict(sample=np.arange(99)))
rows_per_subframe = math.ceil(len(df) / 4.)


subframes = [i[1] for i in df.groupby(np.arange(len(df))//rows_per_subframe)]

注意: groupby返回一个元组，其中第二个元素是数据框，因此提取过程稍微有点复杂。

>>> len(subframes), [len(i) for i in subframes]
(4, [25, 25, 25, 24])

小开

我想现在我们可以使用简单的 iloc和 range。

chunk_size = int(df.shape[0] / 4)
for start in range(0, df.shape[0], chunk_size):
df_subset = df.iloc[start:start + chunk_size]
process_data(df_subset)
....

小开

我也经历了 np.array_split不与熊猫数据框架工作。我的解决方案是只拆分 DataFrame 的索引，然后引入一个带有“ group”标签的新列:

indexes = np.array_split(df.index,N, axis=0)
for i,index in enumerate(indexes):
df.loc[index,'group'] = i

这使得 groupby 操作非常方便，例如计算每组的平均值:

df.groupby(by='group').mean()

小开

您可以使用列表理解在一行中完成此操作

n = 4
chunks = [df[i:i+n] for i in range(0,df.shape[0],n)]

小开

基于“长生不老药”的回答... ..。
我建议用发电机以避免加载内存中的所有块:

def chunkit(df, chunk_size = 10000):
num_chunks = len(df) // chunk_size
if len(df) % chunk_size != 0:
num_chunks += 1
for i in range(num_chunks):
yield df[i*chunk_size:(i + 1) * chunk_size]

小开

我喜欢俏皮话，所以@LucyDrops 的回答对我很有用。

但是，有一个 重要的事: 添加一个 .copy()，如果块应该是原始 df部分的拷贝:

chunks = [df[i:i+n].copy() for i in range(0,df.shape[0],n)]

否则很有可能在进一步处理 chunks期间收到下一个警告(例如循环中) :

A value is trying to be set on a copy of a slice from a DataFrame.

(详情见熊猫档案)