splitting at underscore in python and storing the first value

I have a pandas data frame like df with a column construct_name

construct_name
aaaa_t1_2
cccc_t4_10
bbbb_g3_3

and so on. I want to first split all the names at the underscore and store the first element (aaaa,cccc, etc.) as another column name.

Expected output

construct_name  name
aaaa_t1_2       aaaa
cccc_t4_10      bbbb

and so on.

I tried the following df['construct_name'].map(lambda row:row.split("_")) and it gives me a list like

[aaaa,t1,2]
[cccc,t4,10]

and so on

But when I do

df['construct_name'].map(lambda row:row.split("_"))[0] to get the first element of the list I get an error. Can you suggest a fix. Thanks

59747 次浏览

Just use the vectorised str method split and use integer indexing on the list to get the first element:

In [228]:


df['first'] = df['construct_name'].str.split('_').str[0]
df
Out[228]:
construct_name first
0      aaaa_t1_2  aaaa
1     cccc_t4_10  cccc
2      bbbb_g3_3  bbbb

split take an optional argument maxsplit:

>>> construct_name = 'aaaa_t1_2'
>>> name, rest = construct_name.split('_', 1)
>>> name
'aaaa'

After you do the split, you should get the first element (using [0]). And not after the map.:

In [608]: temp['name'] = temp['construct_name'].map(lambda v: v.split('_')[0])


In [609]: temp
Out[609]:
construct_name  name
0      aaaa_t1_2  aaaa
1     cccc_t4_10  cccc
2      bbbb_g3_3  bbbb

Another way of using the vectorised str.split method is passing the expand=True flag which then returns one column for each of the split parts.

>>> s = pd.Series( ['aaaa_t1_2', 'cccc_t4_10', 'bbbb_g3_3'], name='construct_name')


>>> s.str.split('_', expand=True)  # to see what expand=True does
0   1   2
0  aaaa  t1   2
1  cccc  t4  10
2  bbbb  g3   3


>>>  s.str.split('_', expand=True)[0]  # what you want, select first elements
0    aaaa
1    cccc
2    bbbb

This would be specially useful if you wanted to keep the first and second values for example.

In terms of the general behaviour of the expand=True flag, note that if the input strings do not have the same number of underscores you ca get Nones:

>>> s = pd.Series( ['aaaa_t1_2', 'cccc_t4', 'bbbb_g33'], name='construct_name')


>>> s.str.split('_', expand=True)
0    1     2
0  aaaa   t1     2
1  cccc   t4  None
2  bbbb  g33  None
df['name'] = df['construct_name'].str.split('_').str.get(0)

or

df['name'] = df['construct_name'].str.split('_').apply(lambda x: x[0])