将Pandas GroupBy输出从Series转换为DataFrame

我从这样的输入数据开始

df1 = pandas.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

印刷出来时是这样的:

   City     Name
0   Seattle    Alice
1   Seattle      Bob
2  Portland  Mallory
3   Seattle  Mallory
4   Seattle      Bob
5  Portland  Mallory

分组非常简单:

g1 = df1.groupby( [ "Name", "City"] ).count()

打印输出一个GroupBy对象:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
Seattle      1     1

但我最终想要的是另一个DataFrame对象,它包含GroupBy对象中的所有行。换句话说,我想得到以下结果:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
Mallory Seattle      1     1

我不太清楚如何在pandas文档中实现这一点。欢迎任何提示。

977054 次浏览

g1在这里一个数据框架。不过,它有一个层次索引:

In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame


In [20]: g1.index
Out[20]:
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
('Mallory', 'Seattle')], dtype=object)

也许你想要这样的东西?

In [21]: g1.add_suffix('_Count').reset_index()
Out[21]:
Name      City  City_Count  Name_Count
0    Alice   Seattle           1           1
1      Bob   Seattle           2           2
2  Mallory  Portland           2           2
3  Mallory   Seattle           1           1

或者像这样:

In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]:
Name      City  count
0    Alice   Seattle      1
1      Bob   Seattle      2
2  Mallory  Portland      2
3  Mallory   Seattle      1

我想稍微改变一下Wes给出的答案,因为版本0.16.2需要as_index=False。如果你不设置它,你会得到一个空的数据框架。

# EYZ0:

聚合函数将不会返回您正在聚合的组,如果它们是命名列,当默认为as_index=True时。分组的列将是返回对象的索引。

传递as_index=False将返回您正在聚合的组,如果它们是命名列的话。

聚合函数是降低返回对象的维数的函数,例如:meansumsizecountstdvarsemdescribefirstlastsum0, sum1, sum2。这就是当您执行sum3并返回sum4时所发生的情况。

n可以作为减速器或过滤器,见在这里

import pandas as pd


df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
"City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
print df1
#
#       City     Name
#0   Seattle    Alice
#1   Seattle      Bob
#2  Portland  Mallory
#3   Seattle  Mallory
#4   Seattle      Bob
#5  Portland  Mallory
#
g1 = df1.groupby(["Name", "City"], as_index=False).count()
print g1
#
#                  City  Name
#Name    City
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1
#

编辑:

0.17.1及以后的版本中,您可以在count中使用subset,在size中使用name参数reset_index:

print df1.groupby(["Name", "City"], as_index=False ).count()
#IndexError: list index out of range


print df1.groupby(["Name", "City"]).count()
#Empty DataFrame
#Columns: []
#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]


print df1.groupby(["Name", "City"])[['Name','City']].count()
#                  Name  City
#Name    City
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1


print df1.groupby(["Name", "City"]).size().reset_index(name='count')
#      Name      City  count
#0    Alice   Seattle      1
#1      Bob   Seattle      2
#2  Mallory  Portland      2
#3  Mallory   Seattle      1

countsize之间的区别是size计算NaN值,而count不计算NaN值。

我发现这对我很有用。

import numpy as np
import pandas as pd


df1 = pd.DataFrame({
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})


df1['City_count'] = 1
df1['Name_count'] = 1


df1.groupby(['Name', 'City'], as_index=False).count()

简单地说,这应该完成任务:

import pandas as pd


grouped_df = df1.groupby( [ "Name", "City"] )


pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))
这里,grouped_df.size()提取唯一的groupby计数,reset_index()方法重置您希望它是的列的名称。 最后,调用pandas Dataframe()函数来创建一个DataFrame对象。< / p >

也许我误解了这个问题,但如果你想将groupby转换回数据帧,你可以使用.to_frame()。当我这样做的时候,我想重置索引,所以我也包括了这一部分。

与问题无关的示例代码

df = df['TIME'].groupby(df['Name']).min()
df = df.to_frame()
df = df.reset_index(level=['Name',"TIME"])

我已经与Qty明智的数据聚合并存储到dataframe

almo_grp_data = pd.DataFrame({'Qty_cnt' :
almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
)['Qty'].sum()}).reset_index()

这些解决方案只部分适用于我,因为我正在进行多个聚合。下面是我分组的一个输出示例,我想转换为一个数据框架:

# EYZ0

因为我想要的不仅仅是reset_index()提供的计数,所以我编写了一个手动方法来将上面的图像转换为数据帧。我知道这不是最python /pandas的方式,因为它相当啰嗦和显式,但这是我所需要的。基本上,使用上面解释的reset_index()方法启动一个“脚手架”数据框架,然后循环分组数据框架中的组对,检索索引,对未分组数据框架执行计算,并在新的聚合数据框架中设置值。

df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)


# Grouped gives us the indices we want for each grouping
# We cannot convert a groupedby object back to a dataframe, so we need to do it manually
# Create a new dataframe to work against
df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
df_aggregated['Male Count'] = 0
df_aggregated['Female Count'] = 0
df_aggregated['Job Rate'] = 0


def manualAggregations(indices_array):
temp_df = df.iloc[indices_array]
return {
'Male Count': temp_df['Male Count'].sum(),
'Female Count': temp_df['Female Count'].sum(),
'Job Rate': temp_df['Hourly Rate'].max()
}


for name, group in df_grouped:
ix = df_grouped.indices[name]
calcDict = manualAggregations(ix)


for key in calcDict:
#Salary Basis, Job Title
columns = list(name)
df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) &
(df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]

如果字典不是你的东西,计算可以内联应用在for循环中:

    df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) &
(df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()

下面的解决方案可能更简单:

df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()

关键是使用reset_index ()方法。

使用:

import pandas


df1 = pandas.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )


g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()

现在你在g1中有了新的数据框架:

result dataframe

 grouped=df.groupby(['Team','Year'])['W'].count().reset_index()


team_wins_df=pd.DataFrame(grouped)
team_wins_df=team_wins_df.rename({'W':'Wins'},axis=1)
team_wins_df['Wins']=team_wins_df['Wins'].astype(np.int32)
team_wins_df.reset_index()
print(team_wins_df)

这将以与普通的groupby()方法相同的顺序返回序数级/索引。它基本上与@NehalJWani在他的评论中发布的答案相同,但存储在一个变量中,并调用了reset_index()方法。

fare_class = df.groupby(['Satisfaction Rating','Fare Class']).size().to_frame(name = 'Count')
fare_class.reset_index()

这个版本不仅返回相同的百分比数据,这是有用的统计,而且还包括一个lambda函数。

fare_class_percent = df.groupby(['Satisfaction Rating', 'Fare Class']).size().to_frame(name = 'Percentage')
fare_class_percent.transform(lambda x: 100 * x/x.sum()).reset_index()


Satisfaction Rating      Fare Class  Percentage
0            Dissatisfied        Business   14.624269
1            Dissatisfied         Economy   36.469048
2               Satisfied        Business    5.460425
3               Satisfied         Economy   33.235294
< p >的例子: # EYZ0 < / p >

尝试在group_by方法中设置group_keys = False,以防止将组键添加到索引中。

例子:

import numpy as np
import pandas as pd


df1 = pd.DataFrame({
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})
df1.groupby(["Name"], group_keys=False)