在一个数字数组中插入 NaN 值

有没有一种快速的方法可以用(比如)线性插值值替换一个数字数组中的所有 NaN 值?

比如说,

[1 1 1 nan nan 2 2 nan 0]

会被转化为

[1 1 1 1.3 1.6 2 2  1  0]
74594 次浏览

改变数据的生成方式可能更容易,但如果不是这样的话:

bad_indexes = np.isnan(data)

创建一个布尔数组,指示 nan 的位置

good_indexes = np.logical_not(bad_indexes)

创建一个布尔数组,指示好值区域的位置

good_data = data[good_indexes]

原始数据的受限制版本,不包括奶奶

interpolated = np.interp(bad_indexes.nonzero(), good_indexes.nonzero(), good_data)

通过插值运行所有的坏索引

data[bad_indexes] = interpolated

用插值值替换原始数据。

我想出了这个代码:

import numpy as np
nan = np.nan


A = np.array([1, nan, nan, 2, 2, nan, 0])


ok = -np.isnan(A)
xp = ok.ravel().nonzero()[0]
fp = A[-np.isnan(A)]
x  = np.isnan(A).ravel().nonzero()[0]


A[np.isnan(A)] = np.interp(x, xp, fp)


print A

上面有指纹

 [ 1.          1.33333333  1.66666667  2.          2.          1.          0.        ]

让我们首先定义一个简单的 helper 函数,以便更简单地处理 NaNs的索引和逻辑索引:

import numpy as np


def nan_helper(y):
"""Helper to handle indices and logical indices of NaNs.


Input:
- y, 1d numpy array with possible NaNs
Output:
- nans, logical indices of NaNs
- index, a function, with signature indices= index(logical_indices),
to convert logical indices of NaNs to 'equivalent' indices
Example:
>>> # linear interpolation of NaNs
>>> nans, x= nan_helper(y)
>>> y[nans]= np.interp(x(nans), x(~nans), y[~nans])
"""


return np.isnan(y), lambda z: z.nonzero()[0]

现在 nan_helper(.)可以用来:

>>> y= array([1, 1, 1, NaN, NaN, 2, 2, NaN, 0])
>>>
>>> nans, x= nan_helper(y)
>>> y[nans]= np.interp(x(nans), x(~nans), y[~nans])
>>>
>>> print y.round(2)
[ 1.    1.    1.    1.33  1.67  2.    2.    1.    0.  ]

---
虽然首先指定一个单独的函数来做这样的事情似乎有点夸张:

>>> nans, x= np.isnan(y), lambda z: z.nonzero()[0]

它最终将获得回报。

因此,无论何时处理与 NaN 相关的数据,只需将所有所需的(新的与 NaN 相关的)功能封装在一些特定的辅助函数下。您的代码库将更加连贯和可读,因为它遵循易于理解的习惯用法。

插值确实是一个很好的上下文,可以看到 NaN 处理是如何完成的,但类似的技术也可以在其他各种上下文中使用。

只要使用数字逻辑和那里的语句应用一维插值。

import numpy as np
from scipy import interpolate


def fill_nan(A):
'''
interpolate to fill nan values
'''
inds = np.arange(A.shape[0])
good = np.where(np.isfinite(A))
f = interpolate.interp1d(inds[good], A[good],bounds_error=False)
B = np.where(np.isfinite(A),A,f(inds))
return B

或者以温斯顿的回答为基础

def pad(data):
bad_indexes = np.isnan(data)
good_indexes = np.logical_not(bad_indexes)
good_data = data[good_indexes]
interpolated = np.interp(bad_indexes.nonzero()[0], good_indexes.nonzero()[0], good_data)
data[bad_indexes] = interpolated
return data


A = np.array([[1, 20, 300],
[nan, nan, nan],
[3, 40, 500]])


A = np.apply_along_axis(pad, 0, A)
print A

结果

[[   1.   20.  300.]
[   2.   30.  400.]
[   3.   40.  500.]]

对于二维数据,SciPy 的 griddata对我来说工作得相当不错:

>>> import numpy as np
>>> from scipy.interpolate import griddata
>>>
>>> # SETUP
>>> a = np.arange(25).reshape((5, 5)).astype(float)
>>> a
array([[  0.,   1.,   2.,   3.,   4.],
[  5.,   6.,   7.,   8.,   9.],
[ 10.,  11.,  12.,  13.,  14.],
[ 15.,  16.,  17.,  18.,  19.],
[ 20.,  21.,  22.,  23.,  24.]])
>>> a[np.random.randint(2, size=(5, 5)).astype(bool)] = np.NaN
>>> a
array([[ nan,  nan,  nan,   3.,   4.],
[ nan,   6.,   7.,  nan,  nan],
[ 10.,  nan,  nan,  13.,  nan],
[ 15.,  16.,  17.,  nan,  19.],
[ nan,  nan,  22.,  23.,  nan]])
>>>
>>> # THE INTERPOLATION
>>> x, y = np.indices(a.shape)
>>> interp = np.array(a)
>>> interp[np.isnan(interp)] = griddata(
...     (x[~np.isnan(a)], y[~np.isnan(a)]), # points we know
...     a[~np.isnan(a)],                    # values we know
...     (x[np.isnan(a)], y[np.isnan(a)]))   # points to interpolate
>>> interp
array([[ nan,  nan,  nan,   3.,   4.],
[ nan,   6.,   7.,   8.,   9.],
[ 10.,  11.,  12.,  13.,  14.],
[ 15.,  16.,  17.,  18.,  19.],
[ nan,  nan,  22.,  23.,  nan]])

我在3D 图像上使用它,在2D 切片(350x350的4000个切片)上进行操作。整个操作仍然需要大约一个小时:/

我需要一种方法,在数据的末尾填写 NaN,而主要答案似乎没有这样做。

我想到的这个函数使用了一个线性回归来填充 NaN,这解决了我的问题:

import numpy as np


def linearly_interpolate_nans(y):
# Fit a linear regression to the non-nan y values


# Create X matrix for linreg with an intercept and an index
X = np.vstack((np.ones(len(y)), np.arange(len(y))))


# Get the non-NaN values of X and y
X_fit = X[:, ~np.isnan(y)]
y_fit = y[~np.isnan(y)].reshape(-1, 1)


# Estimate the coefficients of the linear regression
beta = np.linalg.lstsq(X_fit.T, y_fit)[0]


# Fill in all the nan values using the predicted coefficients
y.flat[np.isnan(y)] = np.dot(X[:, np.isnan(y)].T, beta)
return y

下面是一个例子:

# Make an array according to some linear function
y = np.arange(12) * 1.5 + 10.


# First and last value are NaN
y[0] = np.nan
y[-1] = np.nan


# 30% of other values are NaN
for i in range(len(y)):
if np.random.rand() > 0.7:
y[i] = np.nan


# NaN's are filled in!
print (y)
print (linearly_interpolate_nans(y))

基于 布莱恩 · 伍兹给出的答案,我修改了他的代码,还将仅由 NaN组成的列表转换为由零组成的列表:

def fill_nan(A):
'''
interpolate to fill nan values
'''
inds = np.arange(A.shape[0])
good = np.where(np.isfinite(A))
if len(good[0]) == 0:
return np.nan_to_num(A)
f = interp1d(inds[good], A[good], bounds_error=False)
B = np.where(np.isfinite(A), A, f(inds))
return B

简单的补充,我希望它将有用的人。

基于 BRYAN WOODS响应略微优化的版本。它可以正确地处理源数据的开始和结束值,比原始版本快25-30% 。您还可以使用不同类型的插值(有关详细信息,请参阅 scypy.interpolate.interp1d 文档)。

import numpy as np
from scipy.interpolate import interp1d


def fill_nans_scipy1(padata, pkind='linear'):
"""
Interpolates data to fill nan values


Parameters:
padata : nd array
source data with np.NaN values
    

Returns:
nd array
resulting data with interpolated values instead of nans
"""
aindexes = np.arange(padata.shape[0])
agood_indexes, = np.where(np.isfinite(padata))
f = interp1d(agood_indexes
, padata[agood_indexes]
, bounds_error=False
, copy=False
, fill_value="extrapolate"
, kind=pkind)
return f(aindexes)


In [17]: adata = np.array([1, 2, np.NaN, 4])
Out[18]: array([ 1.,  2., nan,  4.])
In [19]: fill_nans_scipy1(adata)
Out[19]: array([1., 2., 3., 4.])

用填充关键字进行插值和外推

如果两边都存在有限值,则以 np.interp表示的数组中的下列解 插入 Nan 值Nan 重视边境np.pad处理,模式如 constantreflect

enter image description here

    import numpy as np
import matplotlib.pyplot as plt
    

    

def extrainterpolate_nans_1d(
arr, kws_pad=({'mode': 'edge'}, {'mode': 'edge'})
):
"""Interpolates and extrapolates nan values.
    

Interpolation is linear, compare np.interp(..).
Extrapolation works with pad keywords, compare np.pad(..).
    

Parameters
----------
arr : np.ndarray, shape (N,)
Array to replace nans in.
kws_pad : dict or (dict, dict)
kwargs for np.pad on left and right side
    

Returns
-------
bool
Description of return value
    

See Also
--------
https://numpy.org/doc/stable/reference/generated/numpy.interp.html
https://numpy.org/doc/stable/reference/generated/numpy.pad.html
https://stackoverflow.com/a/43821453/7128154
"""
assert arr.ndim == 1
if isinstance(kws_pad, dict):
kws_pad_left = kws_pad
kws_pad_right = kws_pad
else:
assert len(kws_pad) == 2
assert isinstance(kws_pad[0], dict)
assert isinstance(kws_pad[1], dict)
kws_pad_left = kws_pad[0]
kws_pad_right = kws_pad[1]
    

arr_ip = arr.copy()
    

# interpolation
inds = np.arange(len(arr_ip))
nan_msk = np.isnan(arr_ip)
arr_ip[nan_msk] = np.interp(inds[nan_msk], inds[~nan_msk], arr[~nan_msk])
    

# detemine pad range
i0 = next(
(ids for ids, val in np.ndenumerate(arr) if not np.isnan(val)), 0)[0]
i1 = next(
(ids for ids, val in np.ndenumerate(arr[::-1]) if not np.isnan(val)), 0)[0]
i1 = len(arr) - i1
# print('pad in range [0:{:}] and [{:}:{:}]'.format(i0, i1, len(arr)))
    

# pad
arr_pad = np.pad(
arr_ip[i0:], pad_width=[(i0, 0)], **kws_pad_left)
arr_pad = np.pad(
arr_pad[:i1], pad_width=[(0, len(arr) - i1)], **kws_pad_right)
    

return arr_pad
    

    

# setup data
ys = np.arange(30, dtype=float)**2/20
ys[:5] = np.nan
ys[20:] = 20
ys[28:] = np.nan
ys[[7, 13, 14, 18, 22]] = np.nan
    

    

ys_ie0 = extrainterpolate_nans_1d(ys)
kws_pad_sym = {'mode': 'symmetric'}
kws_pad_const7 = {'mode': 'constant', 'constant_values':7.}
ys_ie1 = extrainterpolate_nans_1d(ys, kws_pad=(kws_pad_sym, kws_pad_const7))
ys_ie2 = extrainterpolate_nans_1d(ys, kws_pad=(kws_pad_const7, kws_pad_sym))
    

fig, ax = plt.subplots()
    

    

ax.scatter(np.arange(len(ys)), ys, s=15**2, label='ys')
ax.scatter(np.arange(len(ys)), ys_ie0, s=8**2, label='ys_ie0, left_pad edge, right_pad edge')
ax.scatter(np.arange(len(ys)), ys_ie1, s=6**2, label='ys_ie1, left_pad symmetric, right_pad 7')
ax.scatter(np.arange(len(ys)), ys_ie2, s=4**2, label='ys_ie2, left_pad 7, right_pad symmetric')
ax.legend()

正如前面的评论所建议的,做到这一点的最佳方法是使用同行评审的实现。大熊猫图书馆有一个1d 数据的插值方法,它插值 SeriesDataFrame中的 np.nan值:

熊猫。内插

文档非常简洁,推荐阅读! 我的实现:

import pandas as pd


magnitudes_series = pd.Series(magnitudes)    # Convert np.array to pd.Series
magnitudes_series.interpolate(
# I used "akima" because the second derivative of my data has frequent drops to 0
method=interpolation_method,


# Interpolate from both sides of the sequence, up to you (made sense for my data)
limit_direction="both",


# Interpolate only np.nan sequences that have number sequences at the ends of the respective np.nan sequences
limit_area="inside",


inplace=True,
)


# I chose to remove np.nan at the tails of data sequence
magnitudes_series.dropna(inplace=True)


result_in_numpy_array = magnitudes_series.values

对我来说,导入 spypy 看起来有点过头了。这里有一个使用 numpy 并维护与 Np.interp相同约定的简单方法

   def interp_nans(x:[float],left=None, right=None, period=None)->[float]:
"""
e.g. [1 1 1 nan nan 2 2 nan 0] -> [1 1 1 1.3 1.6 2 2  1  0]
    

"""
xp = [i for i, yi in enumerate(x) if np.isfinite(yi)]
fp = [yi for i, yi in enumerate(x) if np.isfinite(yi)]
return list(np.interp(x=list(range(len(x))), xp=xp, fp=fp,left=left,right=right,period=period))

我使用插值来替换所有 NaN 值。

A = np.array([1, nan, nan, 2, 2, nan, 0])
np.interp(np.arange(len(A)),
np.arange(len(A))[np.isnan(A) == False],
A[np.isnan(A) == False])

产出:

array([1. , 1.33333333, 1.66666667, 2. , 2. , 1. , 0. ])