如何在numpy数组中找到最近的值?例子:
np.find_nearest(array, value)
如何在numpy数组中找到最近的值?例子:
np.find_nearest(array, value)
当前回答
下面是一个使用2D数组的版本,如果用户拥有scipy的cdist函数,则使用它,如果用户没有,则使用更简单的距离计算。
默认情况下,输出是最接近输入值的索引,但您可以使用output关键字将其更改为'index', 'value'或'both'之一,其中'value'输出数组[index], 'both'输出索引,数组[index]。
对于非常大的数组,您可能需要使用kind='euclidean',因为默认的scipy cdist函数可能会耗尽内存。
这可能不是绝对最快的解决方案,但已经很接近了。
def find_nearest_2d(array, value, kind='cdist', output='index'):
# 'array' must be a 2D array
# 'value' must be a 1D array with 2 elements
# 'kind' defines what method to use to calculate the distances. Can choose one
# of 'cdist' (default) or 'euclidean'. Choose 'euclidean' for very large
# arrays. Otherwise, cdist is much faster.
# 'output' defines what the output should be. Can be 'index' (default) to return
# the index of the array that is closest to the value, 'value' to return the
# value that is closest, or 'both' to return index,value
import numpy as np
if kind == 'cdist':
try: from scipy.spatial.distance import cdist
except ImportError:
print("Warning (find_nearest_2d): Could not import cdist. Reverting to simpler distance calculation")
kind = 'euclidean'
index = np.where(array == value)[0] # Make sure the value isn't in the array
if index.size == 0:
if kind == 'cdist': index = np.argmin(cdist([value],array)[0])
elif kind == 'euclidean': index = np.argmin(np.sum((np.array(array)-np.array(value))**2.,axis=1))
else: raise ValueError("Keyword 'kind' must be one of 'cdist' or 'euclidean'")
if output == 'index': return index
elif output == 'value': return array[index]
elif output == 'both': return index,array[index]
else: raise ValueError("Keyword 'output' must be one of 'index', 'value', or 'both'")
其他回答
所有的答案都有助于收集信息来编写高效的代码。但是,我已经编写了一个小的Python脚本来针对各种情况进行优化。如果提供的数组已排序,则将是最佳情况。如果搜索一个指定值的最近点的索引,那么对半模块是最省时的。当一个索引对应一个数组时,numpy searchsorted是最有效的。
import numpy as np
import bisect
xarr = np.random.rand(int(1e7))
srt_ind = xarr.argsort()
xar = xarr.copy()[srt_ind]
xlist = xar.tolist()
bisect.bisect_left(xlist, 0.3)
In[63]: %时间平分。bisect_left (xlist, 0.3) CPU次数:user 0ns, sys: 0ns, total: 0ns 壁时间:22.2µs
np.searchsorted(xar, 0.3, side="left")
In [64]: %time np。Searchsorted (xar, 0.3, side="left") CPU次数:user 0ns, sys: 0ns, total: 0ns 壁时间:98.9µs
randpts = np.random.rand(1000)
np.searchsorted(xar, randpts, side="left")
%的时间np。Searchsorted (xar, randpts, side="left") CPU次数:用户4ms, sys: 0ns, total: 4ms 壁时间:1.2 ms
如果我们遵循乘法规则,那么numpy应该花费~100 ms,这意味着快了~83倍。
这是在向量数组中找到最近向量的扩展。
import numpy as np
def find_nearest_vector(array, value):
idx = np.array([np.linalg.norm(x+y) for (x,y) in array-value]).argmin()
return array[idx]
A = np.random.random((10,2))*100
""" A = array([[ 34.19762933, 43.14534123],
[ 48.79558706, 47.79243283],
[ 38.42774411, 84.87155478],
[ 63.64371943, 50.7722317 ],
[ 73.56362857, 27.87895698],
[ 96.67790593, 77.76150486],
[ 68.86202147, 21.38735169],
[ 5.21796467, 59.17051276],
[ 82.92389467, 99.90387851],
[ 6.76626539, 30.50661753]])"""
pt = [6, 30]
print find_nearest_vector(A,pt)
# array([ 6.76626539, 30.50661753])
下面是一个处理非标量“values”数组的版本:
import numpy as np
def find_nearest(array, values):
indices = np.abs(np.subtract.outer(array, values)).argmin(0)
return array[indices]
如果输入是标量,则返回数字类型(例如int, float)的版本:
def find_nearest(array, values):
values = np.atleast_1d(values)
indices = np.abs(np.subtract.outer(array, values)).argmin(0)
out = array[indices]
return out if len(out) > 1 else out[0]
答案总结:如果有一个排序的数组,那么平分代码(如下所示)执行最快。大型数组快100-1000倍,小型数组快2-100倍。它也不需要numpy。 如果你有一个未排序的数组,那么如果数组很大,应该首先考虑使用O(n logn)排序,然后平分,如果数组很小,那么方法2似乎是最快的。
First you should clarify what you mean by nearest value. Often one wants the interval in an abscissa, e.g. array=[0,0.7,2.1], value=1.95, answer would be idx=1. This is the case that I suspect you need (otherwise the following can be modified very easily with a followup conditional statement once you find the interval). I will note that the optimal way to perform this is with bisection (which I will provide first - note it does not require numpy at all and is faster than using numpy functions because they perform redundant operations). Then I will provide a timing comparison against the others presented here by other users.
二等分的一半:
def bisection(array,value):
'''Given an ``array`` , and given a ``value`` , returns an index j such that ``value`` is between array[j]
and array[j+1]. ``array`` must be monotonic increasing. j=-1 or j=len(array) is returned
to indicate that ``value`` is out of range below and above respectively.'''
n = len(array)
if (value < array[0]):
return -1
elif (value > array[n-1]):
return n
jl = 0# Initialize lower
ju = n-1# and upper limits.
while (ju-jl > 1):# If we are not yet done,
jm=(ju+jl) >> 1# compute a midpoint with a bitshift
if (value >= array[jm]):
jl=jm# and replace either the lower limit
else:
ju=jm# or the upper limit, as appropriate.
# Repeat until the test condition is satisfied.
if (value == array[0]):# edge cases at bottom
return 0
elif (value == array[n-1]):# and top
return n-1
else:
return jl
现在我将从其他答案定义代码,它们都返回一个索引:
import math
import numpy as np
def find_nearest1(array,value):
idx,val = min(enumerate(array), key=lambda x: abs(x[1]-value))
return idx
def find_nearest2(array, values):
indices = np.abs(np.subtract.outer(array, values)).argmin(0)
return indices
def find_nearest3(array, values):
values = np.atleast_1d(values)
indices = np.abs(np.int64(np.subtract.outer(array, values))).argmin(0)
out = array[indices]
return indices
def find_nearest4(array,value):
idx = (np.abs(array-value)).argmin()
return idx
def find_nearest5(array, value):
idx_sorted = np.argsort(array)
sorted_array = np.array(array[idx_sorted])
idx = np.searchsorted(sorted_array, value, side="left")
if idx >= len(array):
idx_nearest = idx_sorted[len(array)-1]
elif idx == 0:
idx_nearest = idx_sorted[0]
else:
if abs(value - sorted_array[idx-1]) < abs(value - sorted_array[idx]):
idx_nearest = idx_sorted[idx-1]
else:
idx_nearest = idx_sorted[idx]
return idx_nearest
def find_nearest6(array,value):
xi = np.argmin(np.abs(np.ceil(array[None].T - value)),axis=0)
return xi
现在我来计时代码: 注意方法1、2、4、5没有正确给出间隔。方法1、2、4四舍五入到数组中最近的点(例如>=1.5 -> 2),方法5总是四舍五入(例如1.45 -> 2)。只有方法3和6,当然还有对半给出了正确的间隔。
array = np.arange(100000)
val = array[50000]+0.55
print( bisection(array,val))
%timeit bisection(array,val)
print( find_nearest1(array,val))
%timeit find_nearest1(array,val)
print( find_nearest2(array,val))
%timeit find_nearest2(array,val)
print( find_nearest3(array,val))
%timeit find_nearest3(array,val)
print( find_nearest4(array,val))
%timeit find_nearest4(array,val)
print( find_nearest5(array,val))
%timeit find_nearest5(array,val)
print( find_nearest6(array,val))
%timeit find_nearest6(array,val)
(50000, 50000)
100000 loops, best of 3: 4.4 µs per loop
50001
1 loop, best of 3: 180 ms per loop
50001
1000 loops, best of 3: 267 µs per loop
[50000]
1000 loops, best of 3: 390 µs per loop
50001
1000 loops, best of 3: 259 µs per loop
50001
1000 loops, best of 3: 1.21 ms per loop
[50000]
1000 loops, best of 3: 746 µs per loop
对于一个大数组的对半分割是4us,而次之的是180us,最长的是1.21ms(快100 - 1000倍)。对于较小的数组,它要快2-100倍。
也许对ndarray有帮助:
def find_nearest(X, value):
return X[np.unravel_index(np.argmin(np.abs(X - value)), X.shape)]