将熊猫数据框中的无效值替换为无效值

有没有什么方法可以在 Python 中用 None代替熊猫中的值?

您可以使用 df.replace('pre', 'post')并且可以用另一个值替换一个值,但是如果您想用 None值替换这个值,那么这是不可能的,如果您尝试用 None值替换,您将得到一个奇怪的结果。

举个例子:

df = DataFrame(['-',3,2,5,1,-5,-1,'-',9])
df.replace('-', 0)

它返回一个成功的结果。

但是,

df.replace('-', None)

返回以下结果:

0
0   - // this isn't replaced
1   3
2   2
3   5
4   1
5  -5
6  -1
7  -1 // this is changed to `-1`...
8   9

为什么会有这样奇怪的结果?

因为我想把这个数据帧注入到 MySQL 数据库中,所以我不能把 NaN值注入到数据帧中的任何元素中,而是想把 None值注入到数据帧中。当然,您可以首先将 '-'改为 NaN,然后将 NaN转换为 None,但是我想知道为什么数据帧的行为如此糟糕。

在 Pandas 0.12.0 dev 上在 Python 2.7和 OS X 10.8上进行了测试 在 OS X 上预装的版本,我用 SciPy 安装了熊猫 超级包脚本,告诉你。

250673 次浏览

Actually in later versions of pandas this will give a TypeError:

df.replace('-', None)
TypeError: If "to_replace" and "value" are both None then regex must be a mapping

You can do it by passing either a list or a dictionary:

In [11]: df.replace('-', df.replace(['-'], [None]) # or .replace('-', {0: None})
Out[11]:
0
0  None
1     3
2     2
3     5
4     1
5    -5
6    -1
7  None
8     9

But I recommend using NaNs rather than None:

In [12]: df.replace('-', np.nan)
Out[12]:
0
0  NaN
1    3
2    2
3    5
4    1
5   -5
6   -1
7  NaN
8    9

where is probably what you're looking for. So

data=data.where(data=='-', None)

From the panda docs:

where [returns] an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other).

I prefer the solution using replace with a dict because of its simplicity and elegance:

df.replace({'-': None})

You can also have more replacements:

df.replace({'-': None, 'None': None})

And even for larger replacements, it is always obvious and clear what is replaced by what - which is way harder for long lists, in my opinion.

df = pd.DataFrame(['-',3,2,5,1,-5,-1,'-',9])
df = df.where(df!='-', None)

Setting null values can be done with np.nan:

import numpy as np
df.replace('-', np.nan)

Advantage is that df.last_valid_index() recognizes these as invalid.

Before proceeding with this post, it is important to understand the difference between NaN and None. One is a float type, the other is an object type. Pandas is better suited to working with scalar types as many methods on these types can be vectorised. Pandas does try to handle None and NaN consistently, but NumPy cannot.

My suggestion (and Andy's) is to stick with NaN.

But to answer your question...

pandas >= 0.18: Use na_values=['-'] argument with read_csv

If you loaded this data from CSV/Excel, I have good news for you. You can quash this at the root during data loading instead of having to write a fix with code as a subsequent step.

Most of the pd.read_* functions (such as read_csv and read_excel) accept a na_values attribute.

file.csv

A,B
-,1
3,-
2,-
5,3
1,-2
-5,4
-1,-1
-,0
9,0

Now, to convert the - characters into NaNs, do,

import pandas as pd
df = pd.read_csv('file.csv', na_values=['-'])
df


A    B
0  NaN  1.0
1  3.0  NaN
2  2.0  NaN
3  5.0  3.0
4  1.0 -2.0
5 -5.0  4.0
6 -1.0 -1.0
7  NaN  0.0
8  9.0  0.0

And similar for other functions/file formats.

P.S.: On v0.24+, you can preserve integer type even if your column has NaNs (yes, talk about having the cake and eating it too). You can specify dtype='Int32'

df = pd.read_csv('file.csv', na_values=['-'], dtype='Int32')
df


A    B
0  NaN    1
1    3  NaN
2    2  NaN
3    5    3
4    1   -2
5   -5    4
6   -1   -1
7  NaN    0
8    9    0


df.dtypes


A    Int32
B    Int32
dtype: object

The dtype is not a conventional int type... but rather, a Nullable Integer Type. There are other options.


Handling Numeric Data: pd.to_numeric with errors='coerce

If you're dealing with numeric data, a faster solution is to use pd.to_numeric with the errors='coerce' argument, which coerces invalid values (values that cannot be cast to numeric) to NaN.

pd.to_numeric(df['A'], errors='coerce')


0    NaN
1    3.0
2    2.0
3    5.0
4    1.0
5   -5.0
6   -1.0
7    NaN
8    9.0
Name: A, dtype: float64

To retain (nullable) integer dtype, use

pd.to_numeric(df['A'], errors='coerce').astype('Int32')


0    NaN
1      3
2      2
3      5
4      1
5     -5
6     -1
7    NaN
8      9
Name: A, dtype: Int32

To coerce multiple columns, use apply:

df[['A', 'B']].apply(pd.to_numeric, errors='coerce').astype('Int32')


A    B
0  NaN    1
1    3  NaN
2    2  NaN
3    5    3
4    1   -2
5   -5    4
6   -1   -1
7  NaN    0
8    9    0

...and assign the result back after.

More information can be found in this answer.

Using replace and assigning a new df:

import pandas as pd
df = pd.DataFrame(['-',3,2,5,1,-5,-1,'-',9])
dfnew = df.replace('-', 0)
print(dfnew)




(venv) D:\assets>py teste2.py
0
0  0
1  3
2  2
3  5
4  1
5 -5
df.replace('-', np.nan).astype("object")

This will ensure that you can use isnull() later on your dataframe

With Pandas version ≥1.0.0, I would use DataFrame.replace or Series.replace:

df.replace(old_val, pd.NA, inplace=True)

This is better for two reasons:

  1. It uses pd.NA instead of None or np.nan.
  2. It optionally works in-place which could be more memory efficient depending upon the internal implementation.

Alternatively you can also use mask:

df.mask(df=='-', None)