如何在熊猫数据框架中一次获得多列的值计数?

给定一个熊猫数据框架(Panda DataFrame) ,其中有多个列具有分类值(0或1) ,是否可以方便地同时获取每个列的 value _ count?

例如,假设我生成一个 DataFrame,如下所示:

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 4)), columns=list('abcd'))

我可以得到一个这样的数据框架:

   a  b  c  d
0  0  1  1  0
1  1  1  1  1
2  1  1  1  0
3  0  1  0  0
4  0  0  0  1
5  0  1  1  0
6  0  1  1  1
7  1  0  1  0
8  1  0  1  1
9  0  1  1  0

如何方便地获取每个列的值计数,并方便地获取以下内容?

   a  b  c  d
0  6  3  2  6
1  4  7  8  4

我目前的解决办法是:

pieces = []
for col in df.columns:
tmp_series = df[col].value_counts()
tmp_series.name = col
pieces.append(tmp_series)
df_value_counts = pd.concat(pieces, axis=1)

但是一定有一种更简单的方法,比如堆叠、旋转或分组?

145597 次浏览

Just call apply and pass pd.Series.value_counts:

In [212]:
df = pd.DataFrame(np.random.randint(0, 2, (10, 4)), columns=list('abcd'))
df.apply(pd.Series.value_counts)
Out[212]:
a  b  c  d
0  4  6  4  3
1  6  4  6  7

There is actually a fairly interesting and advanced way of doing this problem with crosstab and melt

df = pd.DataFrame({'a': ['table', 'chair', 'chair', 'lamp', 'bed'],
'b': ['lamp', 'candle', 'chair', 'lamp', 'bed'],
'c': ['mirror', 'mirror', 'mirror', 'mirror', 'mirror']})


df


a       b       c
0  table    lamp  mirror
1  chair  candle  mirror
2  chair   chair  mirror
3   lamp    lamp  mirror
4    bed     bed  mirror

We can first melt the DataFrame

df1 = df.melt(var_name='columns', value_name='index')
df1


columns   index
0        a   table
1        a   chair
2        a   chair
3        a    lamp
4        a     bed
5        b    lamp
6        b  candle
7        b   chair
8        b    lamp
9        b     bed
10       c  mirror
11       c  mirror
12       c  mirror
13       c  mirror
14       c  mirror

And then use the crosstab function to count the values for each column. This preserves the data type as ints which wouldn't be the case for the currently selected answer:

pd.crosstab(index=df1['index'], columns=df1['columns'])


columns  a  b  c
index
bed      1  1  0
candle   0  1  0
chair    2  1  0
lamp     1  2  0
mirror   0  0  5
table    1  0  0

Or in one line, which expands the column names to parameter names with ** (this is advanced)

pd.crosstab(**df.melt(var_name='columns', value_name='index'))

Also, value_counts is now a top-level function. So you can simplify the currently selected answer to the following:

df.apply(pd.value_counts)

You can also try this code:

for i in heart.columns:
x = heart[i].value_counts()
print("Column name is:",i,"and it value is:",x)

Ran into this to see if there was a better way of doing what I was doing. Turns out calling df.apply(pd.value_counts) on a DataFrame whose columns each have their own many distinct values will result in a pretty substantial performance hit.

In this case, it is better to simply iterate over the non-numeric columns in a dictionary comprehension, and leave it as a dictionary:

types_to_count = {"object", "category", "string"}
result = {
col: df[col].value_counts()
for col in df.columns[df.dtypes.isin(types_to_count)]
}

The filtering by types_to_count helps to ensure you don't try to take the value_counts of continuous data.

This is what worked for me:

for column in df.columns:
print("\n" + column)
print(df[column].value_counts())

link to source

To get the counts only for specific columns:

df[['a', 'b']].apply(pd.Series.value_counts)

where df is the name of your dataframe and 'a' and 'b' are the columns for which you want to count the values.

The solution that selects all categorical columns and makes a dataframe with all value counts at once:

df = pd.DataFrame({
'fruits': ['apple', 'mango', 'apple', 'mango', 'mango', 'pear', 'mango'],
'vegetables': ['cucumber', 'eggplant', 'tomato', 'tomato', 'tomato', 'tomato', 'pumpkin'],
'sauces': ['chili', 'chili', 'ketchup', 'ketchup', 'chili', '1000 islands', 'chili']})


cat_cols = df.select_dtypes(include=object).columns.tolist()
(pd.DataFrame(
df[cat_cols]
.melt(var_name='column', value_name='value')
.value_counts())
.rename(columns={0: 'counts'})
.sort_values(by=['column', 'counts']))


counts
column      value
fruits      pear            1
apple           2
mango           4
sauces      1000 islands    1
ketchup         2
chili           4
vegetables  pumpkin         1
eggplant        1
cucumber        1
tomato          4
            

Another solution which can be done:

df = pd.DataFrame(np.random.randint(0, 2, (10, 4)), columns=list('abcd'))
l1 = pd.Series()
for var in df.columns:
l2 = df[var].value_counts()
l1 = pd.concat([l1, l2], axis = 1)
l1

You can use a lambda function:

df.apply(lambda x: x.value_counts())

Your solution wrapped in one line looks even simpler than using groupby, stacking etc:

pd.concat([df[column].value_counts() for column in df], axis = 1)

Sometimes some columns are subsequent in hierarchy, in that case I recommend to "group" them and then make counts:

# note: "_id" is whatever column you have to make the counts with len()
cat_cols = ['column_1', 'column_2']
df.groupby(cat_cols).agg(count=('_id', lambda x: len(x)))

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>count</th>
</tr>
<tr>
<th>column_1</th>
<th>column_2</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top">category_1</th>
<th>Excelent</th>
<td>19</td>
</tr>
<tr>
<th>Good</th>
<td>11</td>
</tr>
<tr>
<th>Bad</th>
<td>1</td>
</tr>
<tr>
<th rowspan="5" valign="top">category_2</th>
<th>Happy</th>
<td>48</td>
</tr>
<tr>
<th>Good mood</th>
<td>158</td>
</tr>
<tr>
<th>Serious</th>
<td>62</td>
</tr>
<tr>
<th>Sad</th>
<td>10</td>
</tr>
<tr>
<th>Depressed</th>
<td>8</td>
</tr>
</tbody>
</table>

Bonus: you can change len(x) to x.nunique() or other lambda functions you want.

Applying the value_counts function gave be unexpected / not the most readable results. But this approach seems super simple and easy to read:

df[["col1", "col2", "col3"]].value_counts()

Here is an example of results if the cols have boolean values:

col1               col2         col3
False              False        False        1000
True         False        1000
True               False        False        1000
True         1000
True         False        1000
True         1000
dtype: int64

you can list the column name

list = ["a", "b", "c", "d"]

then run a for loop using value_counts() function

for i in list:
print(df[i].value_counts())
print("\n")

you can also use this method given below

for column in df.columns:
print("\n" + column)
print(df[column].value_counts())

I thought it would be nice if it could be implemented in a way that works also for columns with different sets of values.

This code will generate a dataframe with hierarchical columns where the top column level signifies the column name from the original dataframe and at the lower level you get each two columns one for the values and one for the counts.

def val_cnts_df(df):
val_cnts_dict = {}
max_length = 0
for col in df:
val_cnts_dict[col] = df[col].value_counts()
max_length = max(max_length, len(val_cnts_dict[col]))


lists = [[col, prefix] for col in val_cnts_dict.keys() for prefix in ['values', 'counts']]
columns = pd.MultiIndex.from_tuples(lists, names=['column', 'value_counts'])


val_cnts_df = pd.DataFrame(data=np.zeros((max_length, len(columns))), columns=columns)


for col in val_cnts_dict:
val_cnts_df[col, 'values'] = val_cnts_dict[col].reset_index()['index']
val_cnts_df[col, 'counts'] = val_cnts_dict[col].reset_index()[col]
        

return val_cnts_df

Example of results:

autos = pd.DataFrame({'brand': ['Audi', 'Audi', 'VW', 'BMW', 'VW', 'VW'],
'gearbox': ['automatic', 'automatic', 'manual', 'automatic',
'manual', 'manual'],
'doors': [5, 5, 5, 2, 5, 5]})


print(val_cnts_df(autos))
column        brand            gearbox        doors
value_counts  values counts    values counts  values counts
0                 VW     3  automatic    3.0    5.0    5.0
1               Audi     2     manual    3.0    2.0    1.0
2                BMW     1        NaN    NaN    NaN    NaN
```