在数字数组中正向填充 NaN 值的最有效方法

例题

作为一个简单的例子,考虑下面定义的 numpy 数组 arr:

import numpy as np
arr = np.array([[5, np.nan, np.nan, 7, 2],
[3, np.nan, 1, 8, np.nan],
[4, 9, 6, np.nan, np.nan]])

其中 arr在控制台输出中如下所示:

array([[  5.,  nan,  nan,   7.,   2.],
[  3.,  nan,   1.,   8.,  nan],
[  4.,   9.,   6.,  nan,  nan]])

现在,我想按行“正向填充”数组 arr中的 nan值。我的意思是用左边最接近的有效值替换每个 nan值。预期的结果是这样的:

array([[  5.,   5.,   5.,  7.,  2.],
[  3.,   3.,   1.,  8.,  8.],
[  4.,   9.,   6.,  6.,  6.]])

试过了

我试过使用 for-loop:

for row_idx in range(arr.shape[0]):
for col_idx in range(arr.shape[1]):
if np.isnan(arr[row_idx][col_idx]):
arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]

我还尝试使用熊猫数据框架作为中间步骤(因为熊猫数据框架有一个非常简洁的内置方法来进行前向填充) :

import pandas as pd
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
arr = df.as_matrix()

上述两种策略都能产生预期的结果,但我一直想知道: 只使用数字向量运算的策略是否是最有效的策略?


摘要

还有其他更有效的方法在 numpy 数组中“正向填充”nan值吗? (例如使用 numpy 向量化操作)


更新: 解决方案比较

到目前为止,我已经尝试了所有的解决方案,下面是我的设置脚本:

import numba as nb
import numpy as np
import pandas as pd


def random_array():
choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]
out = np.random.choice(choices, size=(1000, 10))
return out


def loops_fill(arr):
out = arr.copy()
for row_idx in range(out.shape[0]):
for col_idx in range(1, out.shape[1]):
if np.isnan(out[row_idx, col_idx]):
out[row_idx, col_idx] = out[row_idx, col_idx - 1]
return out


@nb.jit
def numba_loops_fill(arr):
'''Numba decorator solution provided by shx2.'''
out = arr.copy()
for row_idx in range(out.shape[0]):
for col_idx in range(1, out.shape[1]):
if np.isnan(out[row_idx, col_idx]):
out[row_idx, col_idx] = out[row_idx, col_idx - 1]
return out


def pandas_fill(arr):
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
out = df.as_matrix()
return out


def numpy_fill(arr):
'''Solution provided by Divakar.'''
mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]
return out

然后是这个控制台输入:

%timeit -n 1000 loops_fill(random_array())
%timeit -n 1000 numba_loops_fill(random_array())
%timeit -n 1000 pandas_fill(random_array())
%timeit -n 1000 numpy_fill(random_array())

导致控制台输出:

1000 loops, best of 3: 9.64 ms per loop
1000 loops, best of 3: 377 µs per loop
1000 loops, best of 3: 455 µs per loop
1000 loops, best of 3: 351 µs per loop
55777 次浏览

Here's one approach -

mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]

If you don't want to create another array and just fill the NaNs in arr itself, replace the last step with this -

arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]

Sample input, output -

In [179]: arr
Out[179]:
array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],
[  3.,  nan,   1.,   8.,  nan,   5.,  nan],
[  4.,   9.,   6.,  nan,  nan,  nan,   7.]])


In [180]: out
Out[180]:
array([[ 5.,  5.,  5.,  7.,  2.,  6.,  5.],
[ 3.,  3.,  1.,  8.,  8.,  5.,  5.],
[ 4.,  9.,  6.,  6.,  6.,  6.,  7.]])

Use Numba. This should give a significant speedup:

import numba
@numba.jit
def loops_fill(arr):
...

For those who are interested in the problem of having leading np.nan after foward-filling, the following works:

mask = np.isnan(arr)
first_non_zero_idx = (~mask!=0).argmax(axis=1) #Get indices of first non-zero values
arr = [ np.hstack([
[arr[i,first_nonzero]]*(first_nonzero),
arr[i,first_nonzero:]])
for i, first_nonzero in enumerate(first_non_zero_idx) ]

Update: As pointed out by financial_physician in the comments, my initially proposed solution can simply be exchanged with ffill on the reversed array and then reversing the result. There is no relevant performance loss. My initial solution seems to be 2% or 3% faster according to %timeit. I updated the code example below but left my initial text as it was.


For those that came here looking for the backward-fill of NaN values, I modified the solution provided by Divakar above to do exactly that. The trick is that you have to do the accumulation on the reversed array using the minimum except for the maximum.

Here is the code:


# ffill along axis 1, as provided in the answer by Divakar
def ffill(arr):
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[1]), 0)
np.maximum.accumulate(idx, axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]
return out


# Simple solution for bfill provided by financial_physician in comment below
def bfill(arr):
return ffill(arr[:, ::-1])[:, ::-1]


# My outdated modification of Divakar's answer to do a backward-fill
def bfill_old(arr):
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[1] - 1)
idx = np.minimum.accumulate(idx[:, ::-1], axis=1)[:, ::-1]
out = arr[np.arange(idx.shape[0])[:,None], idx]
return out




# Test both functions
arr = np.array([[5, np.nan, np.nan, 7, 2],
[3, np.nan, 1, 8, np.nan],
[4, 9, 6, np.nan, np.nan]])
print('Array:')
print(arr)


print('\nffill')
print(ffill(arr))


print('\nbfill')
print(bfill(arr))


Output:

Array:
[[ 5. nan nan  7.  2.]
[ 3. nan  1.  8. nan]
[ 4.  9.  6. nan nan]]


ffill
[[5. 5. 5. 7. 2.]
[3. 3. 1. 8. 8.]
[4. 9. 6. 6. 6.]]


bfill
[[ 5.  7.  7.  7.  2.]
[ 3.  1.  1.  8. nan]
[ 4.  9.  6. nan nan]]

Edit: Update according to comment of MS_

I liked Divakar's answer on pure numpy. Here's a generalized function for n-dimensional arrays:

def np_ffill(arr, axis):
idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
np.maximum.accumulate(idx, axis=axis, out=idx)
slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
for dim in range(len(arr.shape))])]
for i, k in enumerate(arr.shape)]
slc[axis] = idx
return arr[tuple(slc)]

AFIK pandas can only work with two dimensions, despite having multi-index to make up for it. The only way to accomplish this would be to flatten a DataFrame, unstack desired level, restack, and finally reshape as original. This unstacking/restacking/reshaping, with the pandas sorting involved, is just unnecessary overhead to achieve the same result.

Testing:

def random_array(shape):
choices = [1, 2, 3, 4, np.nan]
out = np.random.choice(choices, size=shape)
return out


ra = random_array((2, 4, 8))
print('arr')
print(ra)
print('\nffull')
print(np_ffill(ra, 1))
raise SystemExit

Output:

arr
[[[ 3. nan  4.  1.  4.  2.  2.  3.]
[ 2. nan  1.  3. nan  4.  4.  3.]
[ 3.  2. nan  4. nan nan  3.  4.]
[ 2.  2.  2. nan  1.  1. nan  2.]]


[[ 2.  3.  2. nan  3.  3.  3.  3.]
[ 3.  3.  1.  4.  1.  4.  1. nan]
[ 4.  2. nan  4.  4.  3. nan  4.]
[ 2.  4.  2.  1.  4.  1.  3. nan]]]


ffull
[[[ 3. nan  4.  1.  4.  2.  2.  3.]
[ 2. nan  1.  3.  4.  4.  4.  3.]
[ 3.  2.  1.  4.  4.  4.  3.  4.]
[ 2.  2.  2.  4.  1.  1.  3.  2.]]


[[ 2.  3.  2. nan  3.  3.  3.  3.]
[ 3.  3.  1.  4.  1.  4.  1.  3.]
[ 4.  2.  1.  4.  4.  3.  1.  4.]
[ 2.  4.  2.  1.  4.  1.  3.  4.]]]

I used np.nan_to_num Example:

data = np.nan_to_num(data, data.mean())

Reference : Numpy document

I like Divakar's answer, but it doesn't work for an edge case where a row starts with np.nan, like the arr below

arr = np.array([[9, np.nan, 4, np.nan, 6, 6, 7, 2, 3, np.nan],
[ np.nan, 5, 5, 6, 5, 3, 2, 1, np.nan, 10]])

The output using Divakar's code would be:

[[ 9.  9.  4.  4.  6.  6.  7.  2.  3.  3.]
[nan  4.  5.  6.  5.  3.  2.  1.  1. 10.]]

Divakar's code can be simplified a bit, and the simplified version solves this issue at the same time:

arr[np.isnan(arr)] = arr[np.nonzero(np.isnan(arr))[0], np.nonzero(np.isnan(arr))[1]-1]


In case of several np.nans in a row (either in the beginning or in the middle), just repeat this operation several times. For instance, if the array has 5 consecutive np.nans, the following code will "forward fill" all of them with the number before these np.nans:

for i in range(0, 5):
value[np.isnan(value)] = value[np.nonzero(np.isnan(value))[0], np.nonzero(np.isnan(value))[1]-1]

unless I miss something, the solutions does not works on any example:

arr  = np.array([[ 3.],
[ 8.],
[np.nan],
[ 7.],
[np.nan],
[ 1.],
[np.nan],
[ 3.],
[ 8.],
[ 8.]])
print("A:::: \n", arr)


print("numpy_fill::: \n ",  numpy_fill(arr))
print("loop_fill",  loops_fill(arr))


A::::
[[ 3.]
[ 8.]
[nan]
[ 7.]
[nan]
[ 1.]
[nan]
[ 3.]
[ 8.]
[ 8.]]
numpy_fill:::
[[ 3.]
[ 8.]
[nan]
[ 7.]
[nan]
[ 1.]
[nan]
[ 3.]
[ 8.]
[ 8.]]
loop_fill [[ 3.]
[ 8.]
[nan]
[ 7.]
[nan]
[ 1.]
[nan]
[ 3.]
[ 8.]
[ 8.]]


Comments ??

If you're willing to use Pandas/ xarray: Let axis be the direction you wish to ffill/bfill over, as shown below,

xr.DataArray(arr).ffill(f'dim_{axis}').values
xr.DataArray(arr).bfill(f'dim_{axis}').values

More information: http://xarray.pydata.org/en/stable/generated/xarray.DataArray.ffill.html https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html

bottleneck push function is a good option to forward fill. It's normally used internally in packages like Xarray, it should be faster than other alternatives and the package also has a set of benchmarks.

Example:

import numpy as np


from bottleneck import push


a = np.array(
[
[1, np.nan, 3],
[np.nan, 3, 2],
[2, np.nan, np.nan]
]
)
push(a, axis=0)
array([[ 1., nan,  3.],
[ 1.,  3.,  2.],
[ 2.,  3.,  2.]])

Use bottleneck module, it comes along with pandas or numpy module so no need to separately install.

Below code should give you desired result.

import bottleneck as bn
bn.push(arr,axis=1)

One liner:

result = np.where(np.isnan(arr), 0, arr)

In a function with forcing float (I needed it in my case because I had dtype=object).

def fillna(arr):
arr = np.array(arr,dtype=float)
out = np.where(np.isnan(arr), 0, arr)
return out
arr = np.array([[5, np.nan, np.nan, 7, 2],
[3, np.nan, 1, 8, np.nan],
[4, 9, 6, np.nan, np.nan]])




result = fillna(arr)
print(result)
# result
# array([[5., 0., 0., 7., 2.],
#        [3., 0., 1., 8., 0.],
#        [4., 9., 6., 0., 0.]])


Minor improvement of of RichieV generalized pure numpy solution with axis selection and 'backward' support

def _np_fill_(arr, axis=-1, fill_dir='f'):
"""Base function for np_fill, np_ffill, np_bfill."""
if axis < 0:
axis = len(arr.shape) + axis
    

if fill_dir.lower() in ['b', 'backward']:
dir_change = tuple([*[slice(None)]*axis, slice(None, None, -1)])
return np_ffill(arr[dir_change])[dir_change]
elif fill_dir.lower() not in ['f', 'forward']:
raise KeyError(f"fill_dir must be one of: 'b', 'backward', 'f', 'forward'. Got: {fill_dir}")
    

idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
np.maximum.accumulate(idx, axis=axis, out=idx)
slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
for dim in range(len(arr.shape))])]
for i, k in enumerate(arr.shape)]
slc[axis] = idx
return arr[tuple(slc)]


def np_fill(arr, axis=-1, fill_dir='f'):
"""General fill function which supports multiple filling steps. I.e.:
fill_dir=['f', 'b'] or fill_dir=['b', 'f']"""
if isinstance(fill_dir, (tuple, list, np.ndarray)):
for i in fill_dir:
arr = _np_fill_(arr, axis=axis, fill_dir=i)
else:
arr = _np_fill_(arr, axis=axis, fill_dir=fill_dir)
return arr


def np_ffill(arr, axis=-1):
return np_fill(arr, axis=axis, fill_dir='forward')


def np_bfill(arr, axis=-1):
return np_fill(arr, axis=axis, fill_dir='backward')