熊猫数据帧总行

我有一个数据框架,比如:

     foo  bar  qux
0    a    1    3.14
1    b    3    2.72
2    c    2    1.62
3    d    9    1.41
4    e    3    0.58

我想在数据框的末尾添加一个“ total”行:

     foo  bar  qux
0    a    1    3.14
1    b    3    2.72
2    c    2    1.62
3    d    9    1.41
4    e    3    0.58
5  total  18   9.47

我尝试过使用 sum命令,但最终得到的是 Series,尽管我可以将它转换回 Dataframe,但它不维护数据类型:

tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo    object
bar    object
qux    object

我希望维护原始数据框架中的数据类型,因为我需要对总行应用其他操作,比如:

baz = 2*tot_row['qux'] + 3*tot_row['bar']
172310 次浏览

Update June 2022

pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!


Append a totals row with

df.append(df.sum(numeric_only=True), ignore_index=True)

The conversion is necessary only if you have a column of strings or objects.

It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.

baz = 2*df['qux'].sum() + 3*df['bar'].sum()

Use DataFrame.pivot_table with margins=True:

import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))

Original df:

  foo  bar   qux
0   a    1  3.14
1   b    3  2.72
2   c    2  1.62
3   d    9  1.41
4   e    3  0.58

Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:

df.pivot_table(index='foo',
margins=True,
margins_name='total',  # defaults to 'All'
aggfunc=sum)

Voilà!

       bar   qux
foo
a        1  3.14
b        3  2.72
c        2  1.62
d        9  1.41
e        3  0.58
total   18  9.47

Alternative way (verified on Pandas 0.18.1):

import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)

Result:

   foo   bar   qux
0    a     1  3.14
1    b     3  2.72
2    c     2  1.62
3    d     9  1.41
4    e     3  0.58
5  tot    18  9.47

Following helped for me to add a column total and row total to a dataframe.

Assume dft1 is your original dataframe... now add a column total and row total with the following steps.

from io import StringIO
import pandas as pd


#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")


#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")


## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)


## add a row total to dft1 with the following steps


sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe


dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total


dft1.append(dft1_sum) # append the row to dft1
df.loc["Total"] = df.sum()

works for me and I find it easier to remember. Am I missing something? Probably wasn't possible in earlier versions.

I'd actually like to add the total row only temporarily though. Adding it permanently is good for display but makes it a hassle in further calculations.

Just found

df.append(df.sum().rename('Total'))

This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.

This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.

df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T

Building on JMZ answer

df.append(df.sum(numeric_only=True), ignore_index=True)

if you want to continue using your current index you can name the sum series using .rename() as follows:

df.append(df.sum().rename('Total'))

This will add a row at the bottom of the table.

Building on answer from Matthias Kauer.

To add row total:

df.loc["Row_Total"] = df.sum()

To add column total,

df.loc[:,"Column_Total"] = df.sum(axis=1)

New Method

To get both row and column total:

import numpy as np
import pandas as pd




df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})


df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)


print(df)




a      b    c  Row_Total
0             10.0  100.0    a      110.0
1             20.0  200.0    b      220.0
Column_Total  30.0  300.0  NaN      330.0

Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.

This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.

Example:

import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()

yields

0
0 1
1 5
2 6
3 8
4 9
0
count 5
mean 5.8
std 3.11448
min 1
25% 5
50% 6
75% 8
max 9

After

df.loc['Totals']= df.sum(numeric_only=True, axis=0)

the dataframe looks like this

0
0 1
1 5
2 6
3 8
4 9
Totals 29

This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:

0
count 6
mean 9.66667
std 9.87252
min 1
25% 5.25
50% 7
75% 8.75
max 29

So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!

When the "totals" need to be added to an index column:

totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)

e.g.

(Pdb) df
count    min bytes    max bytes    mean bytes     std bytes     sum bytes
row_0  837200      67412.0  368733992.0  2.518989e+07  5.122836e+07  2.108898e+13
row_1  299000      85380.0  692782132.0  2.845055e+08  2.026823e+08  8.506713e+13
row_2  837200      67412.0  379484173.0  8.706825e+07  1.071484e+08  7.289354e+13
row_3  239200      85392.0  328063972.0  9.870446e+07  1.016989e+08  2.361011e+13
row_4   59800      67292.0  383487021.0  1.841879e+08  1.567605e+08  1.101444e+13
row_5  717600     112309.0  379483824.0  9.687554e+07  1.103574e+08  6.951789e+13
row_6  119600     664144.0  358486985.0  1.611637e+08  1.171889e+08  1.927518e+13
row_7  478400      67300.0  593141462.0  2.824301e+08  1.446283e+08  1.351146e+14
row_8  358800  215002028.0  327493141.0  2.861329e+08  1.545693e+07  1.026645e+14
row_9  358800  202248016.0  321657935.0  2.684668e+08  1.865470e+07  9.632590e+13


(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count    min bytes     max bytes    mean bytes     std bytes     sum bytes
0  4305600.0  418466685.0  4.132815e+09  1.774725e+09  1.025805e+09  6.365722e+14


(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count    min bytes     max bytes    mean bytes     std bytes     sum bytes
totals  4305600.0  418466685.0  4.132815e+09  1.774725e+09  1.025805e+09  6.365722e+14


(Pdb) df.append(totals)
count    min bytes     max bytes    mean bytes     std bytes     sum bytes
row_0    837200.0      67412.0  3.687340e+08  2.518989e+07  5.122836e+07  2.108898e+13
row_1    299000.0      85380.0  6.927821e+08  2.845055e+08  2.026823e+08  8.506713e+13
row_2    837200.0      67412.0  3.794842e+08  8.706825e+07  1.071484e+08  7.289354e+13
row_3    239200.0      85392.0  3.280640e+08  9.870446e+07  1.016989e+08  2.361011e+13
row_4     59800.0      67292.0  3.834870e+08  1.841879e+08  1.567605e+08  1.101444e+13
row_5    717600.0     112309.0  3.794838e+08  9.687554e+07  1.103574e+08  6.951789e+13
row_6    119600.0     664144.0  3.584870e+08  1.611637e+08  1.171889e+08  1.927518e+13
row_7    478400.0      67300.0  5.931415e+08  2.824301e+08  1.446283e+08  1.351146e+14
row_8    358800.0  215002028.0  3.274931e+08  2.861329e+08  1.545693e+07  1.026645e+14
row_9    358800.0  202248016.0  3.216579e+08  2.684668e+08  1.865470e+07  9.632590e+13
totals  4305600.0  418466685.0  4.132815e+09  1.774725e+09  1.025805e+09  6.365722e+14

New method [September 2022]

TL;DR:

Just use

df.style.concat(df.agg(['sum']).style)

for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!

Explanation

In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.

For example:

import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)

and it will return a formatted table that is visible in jupyter as this:

dataframe with a total row

Styling

with a little longer code, you can even make the last row look different:

df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)

to get:

dataframe with a yellow total row

see other ways to style (such as bold font, or table lines) in the docs

Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:

def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()


if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)


ret.fillna('-', inplace=True)


return ret

This allows me to enter a generic (numeric) df and get a summarized output such as:

     a   b   c Sum Median
0    1   4   7  12      4
1    2   5   8  15      5
2    3   6   9  18      6
Sum  6  15  24   -      -

from:

data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}


df = pd.DataFrame(data)


printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)