如何通过正则表达式从数据框中选择列

我在巨蟒熊猫中有一个数据框架。数据框架的结构如下:

   a    b    c    d1   d2   d3
10   14   12   44  45    78

我想选择以 d 开头的列在 python 中是否有一个简单的方法来实现这一点。

91655 次浏览

You can use a list comprehension to iterate over all of the column names in your DataFrame df and then only select those that begin with 'd'.

df = pd.DataFrame({'a': {0: 10}, 'b': {0: 14}, 'c': {0: 12},
'd1': {0: 44}, 'd2': {0: 45}, 'd3': {0: 78}})

Use list comprehension to iterate over the columns in the dataframe and return their names (c below is a local variable representing the column name).

>>> [c for c in df]
['a', 'b', 'c', 'd1', 'd2', 'd3']

Then only select those beginning with 'd'.

>>> [c for c in df if c[0] == 'd']  # As an alternative to c[0], use c.startswith(...)
['d1', 'd2', 'd3']

Finally, pass this list of columns to the DataFrame.

df[[c for c in df if c.startswith('d')]]
>>> df
d1  d2  d3
0  44  45  78

===========================================================================

TIMINGS (added Feb 2018 per comments from devinbost claiming that this method is slow...)

First, lets create a dataframe with 30k columns:

n = 10000
cols = ['{0}_{1}'.format(letters, number)
for number in range(n) for letters in ('d', 't', 'didi')]
df = pd.DataFrame(np.random.randn(3, n * 3), columns=cols)
>>> df.shape
(3, 30000)


>>> %timeit df[[c for c in df if c[0] == 'd']]  # Simple list comprehension.
# 10 loops, best of 3: 16.4 ms per loop


>>> %timeit df[[c for c in df if c.startswith('d')]]  # More 'pythonic'?
# 10 loops, best of 3: 29.2 ms per loop


>>> %timeit df.select(lambda col: col.startswith('d'), axis=1)  # Solution of gbrener.
# 10 loops, best of 3: 21.4 ms per loop


>>> %timeit df.filter(regex=("d.*"))  # Accepted solution.
# 10 loops, best of 3: 40 ms per loop

You can use DataFrame.filter this way:

import pandas as pd


df = pd.DataFrame(np.array([[2,4,4],[4,3,3],[5,9,1]]),columns=['d','t','didi'])
>>
d  t  didi
0  2  4     4
1  4  3     3
2  5  9     1


df.filter(regex=("d.*"))


>>
d  didi
0  2     4
1  4     3
2  5     1

The idea is to select columns by regex

Use select:

import pandas as pd


df = pd.DataFrame([[10, 14, 12, 44, 45, 78]], columns=['a', 'b', 'c', 'd1', 'd2', 'd3'])


df.select(lambda col: col.startswith('d'), axis=1)

Result:

   d1  d2  d3
0  44  45  78

This is a nice solution if you're not comfortable with regular expressions.

You can also use

df.filter(regex='^d')

On a larger dataset especially, a vectorized approach is actually MUCH FASTER (by more than two orders of magnitude) and is MUCH more readable. I'm providing a screenshot as proof. (Note: Except for the last few lines I wrote at the bottom to make my point clear with a vectorized approach, the other code was derived from the answer by @Alexander.)

enter image description here

Here's that code for reference:

import pandas as pd
import numpy as np
n = 10000
cols = ['{0}_{1}'.format(letters, number)
for number in range(n) for letters in ('d', 't', 'didi')]
df = pd.DataFrame(np.random.randn(30000, n * 3), columns=cols)


%timeit df[[c for c in df if c[0] == 'd']]


%timeit df[[c for c in df if c.startswith('d')]]


%timeit df.select(lambda col: col.startswith('d'), axis=1)


%timeit df.filter(regex=("d.*"))


%timeit df.filter(like='d')


%timeit df.filter(like='d', axis=1)


%timeit df.filter(regex=("d.*"), axis=1)


%timeit df.columns.map(lambda x: x.startswith("d"))


columnVals = df.columns.map(lambda x: x.startswith("d"))


%timeit df.filter(columnVals, axis=1)

You can use the method startswith with index (columns in this case):

df.loc[:, df.columns.str.startswith('d')]

or match with regex:

df.loc[:, df.columns.str.match('^d')]

Get any substring of column names starting with a [abc] until '_', drop any non-matches (NA), remove duplicates and sort.

df.columns.str.extract(r'([abc].*_)', expand=False).dropna().drop_duplicates().sort_values()