查找numpy数组中最近的值

如何在numpy数组中找到最近的价值 ?例子:

np.find_nearest(array, value)
489629 次浏览
import numpy as np
def find_nearest(array, value):
array = np.asarray(array)
idx = (np.abs(array - value)).argmin()
return array[idx]

使用示例:

array = np.random.random(10)
print(array)
# [ 0.21069679  0.61290182  0.63425412  0.84635244  0.91599191  0.00213826
#   0.17104965  0.56874386  0.57319379  0.28719469]


print(find_nearest(array, value=0.5))
# 0.568743859261

稍微修改一下,上面的答案适用于任意维度的数组(1d, 2d, 3d,…):

def find_nearest(a, a0):
"Element in nd array `a` closest to the scalar value `a0`"
idx = np.abs(a - a0).argmin()
return a.flat[idx]

或者,写成一行:

a.flat[np.abs(a - a0).argmin()]

下面是一个处理非标量“values”数组的版本:

import numpy as np


def find_nearest(array, values):
indices = np.abs(np.subtract.outer(array, values)).argmin(0)
return array[indices]

如果输入是标量,则返回数字类型(例如int, float)的版本:

def find_nearest(array, values):
values = np.atleast_1d(values)
indices = np.abs(np.subtract.outer(array, values)).argmin(0)
out = array[indices]
return out if len(out) > 1 else out[0]

这是在向量数组中找到最近向量的扩展。

import numpy as np


def find_nearest_vector(array, value):
idx = np.array([np.linalg.norm(x+y) for (x,y) in array-value]).argmin()
return array[idx]


A = np.random.random((10,2))*100
""" A = array([[ 34.19762933,  43.14534123],
[ 48.79558706,  47.79243283],
[ 38.42774411,  84.87155478],
[ 63.64371943,  50.7722317 ],
[ 73.56362857,  27.87895698],
[ 96.67790593,  77.76150486],
[ 68.86202147,  21.38735169],
[  5.21796467,  59.17051276],
[ 82.92389467,  99.90387851],
[  6.76626539,  30.50661753]])"""
pt = [6, 30]
print find_nearest_vector(A,pt)
# array([  6.76626539,  30.50661753])

如果你不想使用numpy,可以这样做:

def find_nearest(array, value):
n = [abs(i-value) for i in array]
idx = n.index(min(n))
return array[idx]

如果你的数组已经排序并且非常大,这是一个更快的解决方案:

def find_nearest(array,value):
idx = np.searchsorted(array, value, side="left")
if idx > 0 and (idx == len(array) or math.fabs(value - array[idx-1]) < math.fabs(value - array[idx])):
return array[idx-1]
else:
return array[idx]

这可以扩展到非常大的阵列。如果不能假定数组已经排序,可以很容易地修改上面的内容以在方法中排序。对于小型数组来说,这是多余的,但一旦它们变大,这就快得多了。

对于大型数组,@Demitri给出的(优秀)答案比目前标记为最佳的答案快得多。我从以下两个方面调整了他的精确算法:

  1. 不管输入数组是否排序,下面的函数都有效。

  2. 下面的函数返回与最接近的值对应的输入数组的index,这有点更一般。

请注意,下面的函数还处理了一个特定的边缘情况,这将导致@Demitri编写的原始函数中的错误。否则,我的算法和他的一样。

def find_idx_nearest_val(array, value):
idx_sorted = np.argsort(array)
sorted_array = np.array(array[idx_sorted])
idx = np.searchsorted(sorted_array, value, side="left")
if idx >= len(array):
idx_nearest = idx_sorted[len(array)-1]
elif idx == 0:
idx_nearest = idx_sorted[0]
else:
if abs(value - sorted_array[idx-1]) < abs(value - sorted_array[idx]):
idx_nearest = idx_sorted[idx-1]
else:
idx_nearest = idx_sorted[idx]
return idx_nearest

下面是@Ari Onasafari的scipy版本,回答“查找向量数组中最近的向量

In [1]: from scipy import spatial


In [2]: import numpy as np


In [3]: A = np.random.random((10,2))*100


In [4]: A
Out[4]:
array([[ 68.83402637,  38.07632221],
[ 76.84704074,  24.9395109 ],
[ 16.26715795,  98.52763827],
[ 70.99411985,  67.31740151],
[ 71.72452181,  24.13516764],
[ 17.22707611,  20.65425362],
[ 43.85122458,  21.50624882],
[ 76.71987125,  44.95031274],
[ 63.77341073,  78.87417774],
[  8.45828909,  30.18426696]])


In [5]: pt = [6, 30]  # <-- the point to find


In [6]: A[spatial.KDTree(A).query(pt)[1]] # <-- the nearest point
Out[6]: array([  8.45828909,  30.18426696])


#how it works!
In [7]: distance,index = spatial.KDTree(A).query(pt)


In [8]: distance # <-- The distances to the nearest neighbors
Out[8]: 2.4651855048258393


In [9]: index # <-- The locations of the neighbors
Out[9]: 9


#then
In [10]: A[index]
Out[10]: array([  8.45828909,  30.18426696])

回答摘要:如果有一个排序的array,那么平分代码(如下所示)执行最快。大型数组快100-1000倍,小型数组快2-100倍。它也不需要numpy。 如果你有一个未排序的array,那么如果array很大,应该首先考虑使用O(n logn)排序,然后对其进行等分,如果array很小,那么方法2似乎是最快的

首先,你应该澄清你所说的最接近值是什么意思。通常人们希望区间是横坐标,例如array=[0,0.7,2.1], value=1.95,答案将是idx=1。这就是我怀疑您需要的情况(否则,一旦您找到间隔,可以很容易地使用后续条件语句修改以下内容)。我要指出的是,执行此操作的最佳方法是使用二分法(我将首先提供这种方法——注意它根本不需要numpy,并且比使用numpy函数更快,因为它们执行冗余操作)。然后,我将与其他用户在这里提供的其他时间进行比较。

二等分的一半:

def bisection(array,value):
'''Given an ``array`` , and given a ``value`` , returns an index j such that ``value`` is between array[j]
and array[j+1]. ``array`` must be monotonic increasing. j=-1 or j=len(array) is returned
to indicate that ``value`` is out of range below and above respectively.'''
n = len(array)
if (value < array[0]):
return -1
elif (value > array[n-1]):
return n
jl = 0# Initialize lower
ju = n-1# and upper limits.
while (ju-jl > 1):# If we are not yet done,
jm=(ju+jl) >> 1# compute a midpoint with a bitshift
if (value >= array[jm]):
jl=jm# and replace either the lower limit
else:
ju=jm# or the upper limit, as appropriate.
# Repeat until the test condition is satisfied.
if (value == array[0]):# edge cases at bottom
return 0
elif (value == array[n-1]):# and top
return n-1
else:
return jl

现在我将从其他答案定义代码,它们都返回一个索引:

import math
import numpy as np


def find_nearest1(array,value):
idx,val = min(enumerate(array), key=lambda x: abs(x[1]-value))
return idx


def find_nearest2(array, values):
indices = np.abs(np.subtract.outer(array, values)).argmin(0)
return indices


def find_nearest3(array, values):
values = np.atleast_1d(values)
indices = np.abs(np.int64(np.subtract.outer(array, values))).argmin(0)
out = array[indices]
return indices


def find_nearest4(array,value):
idx = (np.abs(array-value)).argmin()
return idx




def find_nearest5(array, value):
idx_sorted = np.argsort(array)
sorted_array = np.array(array[idx_sorted])
idx = np.searchsorted(sorted_array, value, side="left")
if idx >= len(array):
idx_nearest = idx_sorted[len(array)-1]
elif idx == 0:
idx_nearest = idx_sorted[0]
else:
if abs(value - sorted_array[idx-1]) < abs(value - sorted_array[idx]):
idx_nearest = idx_sorted[idx-1]
else:
idx_nearest = idx_sorted[idx]
return idx_nearest


def find_nearest6(array,value):
xi = np.argmin(np.abs(np.ceil(array[None].T - value)),axis=0)
return xi

现在我将计时代码: 注意方法1、2、4、5没有正确给出间隔。方法1、2、4四舍五入到数组中最近的点(例如>=1.5 -> 2),方法5总是四舍五入(例如1.45 -> 2)。只有方法3和6,当然还有对半给出了正确的间隔。

array = np.arange(100000)
val = array[50000]+0.55
print( bisection(array,val))
%timeit bisection(array,val)
print( find_nearest1(array,val))
%timeit find_nearest1(array,val)
print( find_nearest2(array,val))
%timeit find_nearest2(array,val)
print( find_nearest3(array,val))
%timeit find_nearest3(array,val)
print( find_nearest4(array,val))
%timeit find_nearest4(array,val)
print( find_nearest5(array,val))
%timeit find_nearest5(array,val)
print( find_nearest6(array,val))
%timeit find_nearest6(array,val)


(50000, 50000)
100000 loops, best of 3: 4.4 µs per loop
50001
1 loop, best of 3: 180 ms per loop
50001
1000 loops, best of 3: 267 µs per loop
[50000]
1000 loops, best of 3: 390 µs per loop
50001
1000 loops, best of 3: 259 µs per loop
50001
1000 loops, best of 3: 1.21 ms per loop
[50000]
1000 loops, best of 3: 746 µs per loop

对于一个大数组的对半分割是4us,而次之的是180us,最长的是1.21ms(快100 - 1000倍)。对于较小的数组,它要快2-100倍。

我认为最python的方式是:

 num = 65 # Input number
array = np.random.random((10))*100 # Given array
nearest_idx = np.where(abs(array-num)==abs(array-num).min())[0] # If you want the index of the element of array (array) nearest to the the given number (num)
nearest_val = array[abs(array-num)==abs(array-num).min()] # If you directly want the element of array (array) nearest to the given number (num)

这是基本代码。你可以把它作为一个函数来使用

下面是一个快速向量化的@Dimitri的解决方案,如果你有很多values要搜索(values可以是多维数组):

# `values` should be sorted
def get_closest(array, values):
# make sure array is a numpy array
array = np.array(array)


# get insert positions
idxs = np.searchsorted(array, values, side="left")
    

# find indexes where previous index is closer
prev_idx_is_less = ((idxs == len(array))|(np.fabs(values - array[np.maximum(idxs-1, 0)]) < np.fabs(values - array[np.minimum(idxs, len(array)-1)])))
idxs[prev_idx_is_less] -= 1
    

return array[idxs]

基准

>使用@Demitri的解决方案比使用for循环快100倍

>>> %timeit ar=get_closest(np.linspace(1, 1000, 100), np.random.randint(0, 1050, (1000, 1000)))
139 ms ± 4.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


>>> %timeit ar=[find_nearest(np.linspace(1, 1000, 100), value) for value in np.random.randint(0, 1050, 1000*1000)]
took 21.4 seconds

所有的答案都有助于收集信息来编写高效的代码。但是,我已经编写了一个小的Python脚本来针对各种情况进行优化。如果提供的数组已排序,则将是最佳情况。如果搜索指定值的最近点的索引,则bisect模块是最省时的。当一个索引对应一个数组时,numpy searchsorted是最有效的。

import numpy as np
import bisect
xarr = np.random.rand(int(1e7))


srt_ind = xarr.argsort()
xar = xarr.copy()[srt_ind]
xlist = xar.tolist()
bisect.bisect_left(xlist, 0.3)

In [63]: %time平分。bisect_left (xlist, 0.3) CPU次数:user 0ns, sys: 0ns, total: 0ns 壁时间:22.2µs

np.searchsorted(xar, 0.3, side="left")

In [64]: %time np.;Searchsorted (xar, 0.3, side="left") CPU次数:user 0ns, sys: 0ns, total: 0ns 壁时间:98.9µs

randpts = np.random.rand(1000)
np.searchsorted(xar, randpts, side="left")
< p > %时间np。Searchsorted (xar, randpts, side="left") CPU次数:用户4ms, sys: 0ns, total: 4ms 壁时间:1.2 ms

如果我们遵循乘法规则,那么numpy应该花费~100 ms,这意味着快了~83倍。

import numpy as np
def find_nearest(array, value):
array = np.array(array)
z=np.abs(array-value)
y= np.where(z == z.min())
m=np.array(y)
x=m[0,0]
y=m[1,0]
near_value=array[x,y]


return near_value


array =np.array([[60,200,30],[3,30,50],[20,1,-50],[20,-500,11]])
print(array)
value = 0
print(find_nearest(array, value))

这是unutbu的回答的向量化版本:

def find_nearest(array, values):
array = np.asarray(array)


# the last dim must be 1 to broadcast in (array - values) below.
values = np.expand_dims(values, axis=-1)


indices = np.abs(array - values).argmin(axis=-1)


return array[indices]




image = plt.imread('example_3_band_image.jpg')


print(image.shape) # should be (nrows, ncols, 3)


quantiles = np.linspace(0, 255, num=2 ** 2, dtype=np.uint8)


quantiled_image = find_nearest(quantiles, image)


print(quantiled_image.shape) # should be (nrows, ncols, 3)

可能对ndarrays有帮助:

def find_nearest(X, value):
return X[np.unravel_index(np.argmin(np.abs(X - value)), X.shape)]

对于2d数组,要确定最近元素的i, j位置:

import numpy as np
def find_nearest(a, a0):
idx = (np.abs(a - a0)).argmin()
w = a.shape[1]
i = idx // w
j = idx - i * w
return a[i,j], i, j

下面是一个用于2D数组的版本,如果用户拥有它,则使用scipy的cdist函数,如果用户没有,则使用更简单的距离计算。

默认情况下,输出是最接近输入值的索引,但可以通过output关键字将其更改为'index''value''both'之一,其中'value'输出array[index]'both'输出index, array[index]

对于非常大的数组,可能需要使用kind='euclidean',因为默认的scipy cdist函数可能会耗尽内存。

这可能不是绝对最快的解决方案,但已经很接近了。

def find_nearest_2d(array, value, kind='cdist', output='index'):
# 'array' must be a 2D array
# 'value' must be a 1D array with 2 elements
# 'kind' defines what method to use to calculate the distances. Can choose one
#    of 'cdist' (default) or 'euclidean'. Choose 'euclidean' for very large
#    arrays. Otherwise, cdist is much faster.
# 'output' defines what the output should be. Can be 'index' (default) to return
#    the index of the array that is closest to the value, 'value' to return the
#    value that is closest, or 'both' to return index,value
import numpy as np
if kind == 'cdist':
try: from scipy.spatial.distance import cdist
except ImportError:
print("Warning (find_nearest_2d): Could not import cdist. Reverting to simpler distance calculation")
kind = 'euclidean'
index = np.where(array == value)[0] # Make sure the value isn't in the array
if index.size == 0:
if kind == 'cdist': index = np.argmin(cdist([value],array)[0])
elif kind == 'euclidean': index = np.argmin(np.sum((np.array(array)-np.array(value))**2.,axis=1))
else: raise ValueError("Keyword 'kind' must be one of 'cdist' or 'euclidean'")
if output == 'index': return index
elif output == 'value': return array[index]
elif output == 'both': return index,array[index]
else: raise ValueError("Keyword 'output' must be one of 'index', 'value', or 'both'")

对于那些搜索多个最接近的,修改接受的答案:

import numpy as np
def find_nearest(array, value, k):
array = np.asarray(array)
idx = np.argsort(abs(array - value))[:k]
return array[idx]
< p >看: https://stackoverflow.com/a/66937734/11671779 < / p >
这个函数使用numpy searchsorted处理任意数量的查询,所以在对输入数组进行排序之后,它的速度也一样快。 它可以在2d, 3d的规则网格上工作…: enter image description here < / p >
#!/usr/bin/env python3
# keywords: nearest-neighbor regular-grid python numpy searchsorted Voronoi


import numpy as np


#...............................................................................
class Near_rgrid( object ):
""" nearest neighbors on a Manhattan aka regular grid
1d:
near = Near_rgrid( x: sorted 1d array )
nearix = near.query( q: 1d ) -> indices of the points x_i nearest each q_i
x[nearix[0]] is the nearest to q[0]
x[nearix[1]] is the nearest to q[1] ...
nearpoints = x[nearix] is near q
If A is an array of e.g. colors at x[0] x[1] ...,
A[nearix] are the values near q[0] q[1] ...
Query points < x[0] snap to x[0], similarly > x[-1].


2d: on a Manhattan aka regular grid,
streets running east-west at y_i, avenues north-south at x_j,
near = Near_rgrid( y, x: sorted 1d arrays, e.g. latitide longitude )
I, J = near.query( q: nq × 2 array, columns qy qx )
-> nq × 2 indices of the gridpoints y_i x_j nearest each query point
gridpoints = np.column_stack(( y[I], x[J] ))  # e.g. street corners
diff = gridpoints - querypoints
distances = norm( diff, axis=1, ord= )
Values at an array A definded at the gridpoints y_i x_j nearest q: A[I,J]


3d: Near_rgrid( z, y, x: 1d axis arrays ) .query( q: nq × 3 array )


See Howitworks below, and the plot Voronoi-random-regular-grid.
"""


def __init__( self, *axes: "1d arrays" ):
axarrays = []
for ax in axes:
axarray = np.asarray( ax ).squeeze()
assert axarray.ndim == 1, "each axis should be 1d, not %s " % (
str( axarray.shape ))
axarrays += [axarray]
self.midpoints = [_midpoints( ax ) for ax in axarrays]
self.axes = axarrays
self.ndim = len(axes)


def query( self, queries: "nq × dim points" ) -> "nq × dim indices":
""" -> the indices of the nearest points in the grid """
queries = np.asarray( queries ).squeeze()  # or list x y z ?
if self.ndim == 1:
assert queries.ndim <= 1, queries.shape
return np.searchsorted( self.midpoints[0], queries )  # scalar, 0d ?
queries = np.atleast_2d( queries )
assert queries.shape[1] == self.ndim, [
queries.shape, self.ndim]
return [np.searchsorted( mid, q )  # parallel: k axes, k processors
for mid, q in zip( self.midpoints, queries.T )]


def snaptogrid( self, queries: "nq × dim points" ):
""" -> the nearest points in the grid, 2d [[y_j x_i] ...] """
ix = self.query( queries )
if self.ndim == 1:
return self.axes[0][ix]
else:
axix = [ax[j] for ax, j in zip( self.axes, ix )]
return np.array( axix )




def _midpoints( points: "array-like 1d, *must be sorted*" ) -> "1d":
points = np.asarray( points ).squeeze()
assert points.ndim == 1, points.shape
diffs = np.diff( points )
assert np.nanmin( diffs ) > 0, "the input array must be sorted, not %s " % (
points.round( 2 ))
return (points[:-1] + points[1:]) / 2  # floats


#...............................................................................
Howitworks = \
"""
How Near_rgrid works in 1d:
Consider the midpoints halfway between fenceposts | | |
The interval [left midpoint .. | .. right midpoint] is what's nearest each post --


|   |       |                     |   points
| . |   .   |          .          |   midpoints
^^^^^^               .            nearest points[1]
^^^^^^^^^^^^^^^             nearest points[2]  etc.


2d:
I, J = Near_rgrid( y, x ).query( q )
I = nearest in `x`
J = nearest in `y` independently / in parallel.
The points nearest [yi xj] in a regular grid (its Voronoi cell)
form a rectangle [left mid x .. right mid x] × [left mid y .. right mid y]
(in any norm ?)
See the plot Voronoi-random-regular-grid.


Notes
-----
If a query point is exactly halfway between two data points,
e.g. on a grid of ints, the lines (x + 1/2) U (y + 1/2),
which "nearest" you get is implementation-dependent, unpredictable.


"""


Murky = \
""" NaNs in points, in queries ?
"""


__version__ = "2021-10-25 oct  denis-bz-py"