根据条件获取数据帧行计数

我想获得基于条件选择的数据帧行的计数。

print df[(df.IP == head.idxmax()) & (df.Method == 'HEAD') & (df.Referrer == '"-"')].count()

产出:

IP          57
Time        57
Method      57
Resource    57
Status      57
Bytes       57
Referrer    57
Agent       57
dtype: int64

输出显示了数据框中每一列的计数。相反,我需要得到一个单一的计数,其中所有上述条件都满足?怎么做?如果你需要更多关于我的数据框架的解释,请让我知道。

242595 次浏览

You are asking for the condition where all the conditions are true, so len of the frame is the answer, unless I misunderstand what you are asking

In [17]: df = DataFrame(randn(20,4),columns=list('ABCD'))


In [18]: df[(df['A']>0) & (df['B']>0) & (df['C']>0)]
Out[18]:
A         B         C         D
12  0.491683  0.137766  0.859753 -1.041487
13  0.376200  0.575667  1.534179  1.247358
14  0.428739  1.539973  1.057848 -1.254489


In [19]: df[(df['A']>0) & (df['B']>0) & (df['C']>0)].count()
Out[19]:
A    3
B    3
C    3
D    3
dtype: int64


In [20]: len(df[(df['A']>0) & (df['B']>0) & (df['C']>0)])
Out[20]: 3

For increased performance you should not evaluate the dataframe using your predicate. You can just use the outcome of your predicate directly as illustrated below:

In [1]: import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(20,4),columns=list('ABCD'))




In [2]: df.head()
Out[2]:
A         B         C         D
0 -2.019868  1.227246 -0.489257  0.149053
1  0.223285 -0.087784 -0.053048 -0.108584
2 -0.140556 -0.299735 -1.765956  0.517803
3 -0.589489  0.400487  0.107856  0.194890
4  1.309088 -0.596996 -0.623519  0.020400


In [3]: %time sum((df['A']>0) & (df['B']>0))
CPU times: user 1.11 ms, sys: 53 µs, total: 1.16 ms
Wall time: 1.12 ms
Out[3]: 4


In [4]: %time len(df[(df['A']>0) & (df['B']>0)])
CPU times: user 1.38 ms, sys: 78 µs, total: 1.46 ms
Wall time: 1.42 ms
Out[4]: 4

Keep in mind that this technique only works for counting the number of rows that comply with your predicate.

In Pandas, I like to use the shape attribute to get number of rows.

df[df.A > 0].shape[0]

gives the number of rows matching the condition A > 0, as desired.

You can use the method query and get the shape of the resulting dataframe. For example:

   A  B  C
0  1  1  x
1  2  2  y
2  3  3  z


df.query("A == 2 & B > 1 & C != 'z'").shape[0]

Output:

1
import pandas as pd
data = {'title': ['Manager', 'Technical Analyst', 'Software Engineer', 'Sales Manager'], 'Description': [
'''a man or woman who controls an organization or part of an organization,a person who looks after the business affairs of a singer, actor, etc''',
'''Technical analysts, also known as chartists or technicians, employ technical analysis in their trading and research. Technical analysis looks for price patterns and trends based on historical performance to identify signals based on market sentiment and psychology.''',
'''A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term programmer is sometimes used as a synonym, but may also lack connotations of engineering education or skills.''',
'''A sales manager is someone who leads and supervises sales agents and runs the day-to-day sales operations of a business. They oversee the sales strategy, set sales goals, and track sales performance'''
]}
df = pd.DataFrame(data)
data2 = {'title': ['Manager', 'Technical Analyst', 'Software Engineer', 'Sales Manager'], 'Keywords': [
['organization','business','people','arrange']
,['technicians','analysis','research','business']
,['engineering', 'design', 'develop', 'maintain']
,['supervises', 'agents','business','performance','target']
]}
df2 = pd.DataFrame(data2)
print(df2)
df2=df2.explode('Keywords')


print(df2)
print("checking df3")
df3=df.merge(df2,how='left',on='title')
print(df3)
df3['match'] = df3.apply(lambda x: x.Keywords in x.Description, axis=1)
print(df3)
df4=df3.loc[df3['match']==True].groupby(['Description']).count()
print(df4)