如何循环分组熊猫数据框架?

DataFrame:

  c_os_family_ss c_os_major_is l_customer_id_i
0      Windows 7                         90418
1      Windows 7                         90418
2      Windows 7                         90418

代码:

print df
for name, group in df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)):
print name
print group

我试图只是循环聚合数据,但我得到了错误:

ValueError:解包的值太多

@EdChum,这是预期的输出:

                                                    c_os_family_ss  \
l_customer_id_i
131572           Windows 7,Windows 7,Windows 7,Windows 7,Window...
135467           Windows 7,Windows 7,Windows 7,Windows 7,Window...


c_os_major_is
l_customer_id_i
131572           ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...
135467           ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...

输出不是问题,我希望遍历每一组。

368575 次浏览

df.groupby('l_customer_id_i').agg(lambda x: ','.join(x))已经返回一个数据帧,所以你不能再循环遍历组。

一般来说:

  • df.groupby(...)返回一个GroupBy对象(一个DataFrameGroupBy或SeriesGroupBy),有了它,你可以遍历组(如文档在这里中解释的那样)。你可以这样做:

    grouped = df.groupby('A')
    
    
    for name, group in grouped:
    ...
    
  • When you apply a function on the groupby, in your example df.groupby(...).agg(...) (but this can also be transform, apply, mean, ...), you combine the result of applying the function to the different groups together in one dataframe (the apply and combine step of the 'split-apply-combine' paradigm of groupby). So the result of this will always be again a DataFrame (or a Series depending on the applied function).

如果已经创建了数据帧,则可以遍历索引值。

df = df.groupby('l_customer_id_i').agg(lambda x: ','.join(x))
for name in df.index:
print name
print df.loc[name]

下面是一个迭代pd.DataFrame的例子,该pd.DataFrame按列atable分组。对于这个示例,"create"SQL数据库的语句在for循环中生成:

import pandas as pd


df1 = pd.DataFrame({
'atable':     ['Users', 'Users', 'Domains', 'Domains', 'Locks'],
'column':     ['col_1', 'col_2', 'col_a', 'col_b', 'col'],
'column_type':['varchar', 'varchar', 'int', 'varchar', 'varchar'],
'is_null':    ['No', 'No', 'Yes', 'No', 'Yes'],
})


df1_grouped = df1.groupby('atable')


# iterate over each group
for group_name, df_group in df1_grouped:
print('\nCREATE TABLE {}('.format(group_name))


for row_index, row in df_group.iterrows():
col = row['column']
column_type = row['column_type']
is_null = 'NOT NULL' if row['is_null'] == 'No' else ''
print('\t{} {} {},'.format(col, column_type, is_null))


print(");")