根据键翻译数组中的每个元素

我试图翻译 numpy.array的每一个元素根据给定的关键:

例如:

a = np.array([[1,2,3],
[3,2,4]])


my_dict = {1:23, 2:34, 3:36, 4:45}

我想得到:

array([[ 23.,  34.,  36.],
[ 36.,  34.,  45.]])

我可以看到如何使用循环:

def loop_translate(a, my_dict):
new_a = np.empty(a.shape)
for i,row in enumerate(a):
new_a[i,:] = map(my_dict.get, row)
return new_a

有没有更有效和/或纯粹麻木的方法?

编辑:

我对它进行了计时,DSM 提出的 np.vectorize方法对于更大的阵列来说要快得多:

In [13]: def loop_translate(a, my_dict):
....:     new_a = np.empty(a.shape)
....:     for i,row in enumerate(a):
....:         new_a[i,:] = map(my_dict.get, row)
....:     return new_a
....:


In [14]: def vec_translate(a, my_dict):
....:     return np.vectorize(my_dict.__getitem__)(a)
....:


In [15]: a = np.random.randint(1,5, (4,5))


In [16]: a
Out[16]:
array([[2, 4, 3, 1, 1],
[2, 4, 3, 2, 4],
[4, 2, 1, 3, 1],
[2, 4, 3, 4, 1]])


In [17]: %timeit loop_translate(a, my_dict)
10000 loops, best of 3: 77.9 us per loop


In [18]: %timeit vec_translate(a, my_dict)
10000 loops, best of 3: 70.5 us per loop


In [19]: a = np.random.randint(1, 5, (500,500))


In [20]: %timeit loop_translate(a, my_dict)
1 loops, best of 3: 298 ms per loop


In [21]: %timeit vec_translate(a, my_dict)
10 loops, best of 3: 37.6 ms per loop


In [22]:  %timeit loop_translate(a, my_dict)
60127 次浏览

I don't know about efficient, but you could use np.vectorize on the .get method of dictionaries:

>>> a = np.array([[1,2,3],
[3,2,4]])
>>> my_dict = {1:23, 2:34, 3:36, 4:45}
>>> np.vectorize(my_dict.get)(a)
array([[23, 34, 36],
[36, 34, 45]])

I think it'd be better to iterate over the dictionary, and set values in all the rows and columns "at once":

>>> a = np.array([[1,2,3],[3,2,1]])
>>> a
array([[1, 2, 3],
[3, 2, 1]])
>>> d = {1 : 11, 2 : 22, 3 : 33}
>>> for k,v in d.iteritems():
...     a[a == k] = v
...
>>> a
array([[11, 22, 33],
[33, 22, 11]])

Edit:

While it may not be as sexy as DSM's (really good) answer using numpy.vectorize, my tests of all the proposed methods show that this approach (using @jamylak's suggestion) is actually a bit faster:

from __future__ import division
import numpy as np
a = np.random.randint(1, 5, (500,500))
d = {1 : 11, 2 : 22, 3 : 33, 4 : 44}


def unique_translate(a,d):
u,inv = np.unique(a,return_inverse = True)
return np.array([d[x] for x in u])[inv].reshape(a.shape)


def vec_translate(a, d):
return np.vectorize(d.__getitem__)(a)


def loop_translate(a,d):
n = np.ndarray(a.shape)
for k in d:
n[a == k] = d[k]
return n


def orig_translate(a, d):
new_a = np.empty(a.shape)
for i,row in enumerate(a):
new_a[i,:] = map(d.get, row)
return new_a




if __name__ == '__main__':
import timeit
n_exec = 100
print 'orig'
print timeit.timeit("orig_translate(a,d)",
setup="from __main__ import np,a,d,orig_translate",
number = n_exec) / n_exec
print 'unique'
print timeit.timeit("unique_translate(a,d)",
setup="from __main__ import np,a,d,unique_translate",
number = n_exec) / n_exec
print 'vec'
print timeit.timeit("vec_translate(a,d)",
setup="from __main__ import np,a,d,vec_translate",
number = n_exec) / n_exec
print 'loop'
print timeit.timeit("loop_translate(a,d)",
setup="from __main__ import np,a,d,loop_translate",
number = n_exec) / n_exec

Outputs:

orig
0.222067718506
unique
0.0472617006302
vec
0.0357889199257
loop
0.0285375618935

Here's another approach, using numpy.unique:

>>> a = np.array([[1,2,3],[3,2,1]])
>>> a
array([[1, 2, 3],
[3, 2, 1]])
>>> d = {1 : 11, 2 : 22, 3 : 33}
>>> u,inv = np.unique(a,return_inverse = True)
>>> np.array([d[x] for x in u])[inv].reshape(a.shape)
array([[11, 22, 33],
[33, 22, 11]])

This approach is much faster than np.vectorize approach when the number of unique elements in array is small. Explanaion: Python is slow, in this approach the in-python loop is used to convert unique elements, afterwards we rely on extremely optimized numpy indexing operation (done in C) to do the mapping. Hence, if the number of unique elements is comparable to the overall size of the array then there will be no speedup. On the other hand, if there is just a few unique elements, then you can observe a speedup of up to x100.

If you don't really have to use dictionary as substitution table, simple solution would be (for your example):

a = numpy.array([your array])
my_dict = numpy.array([0, 23, 34, 36, 45])     # your dictionary as array


def Sub (myarr, table) :
return table[myarr]


values = Sub(a, my_dict)

This will work of course only if indexes of d cover all possible values of your a, in other words, only for a with usigned integers.

The numpy_indexed package (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:

import numpy_indexed as npi
remapped_a = npi.remap(a, list(my_dict.keys()), list(my_dict.values()))

The method implemented is similar to the approach mentioned by John Vinyard, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves.

If you set the optional 'missing' kwarg to 'raise' (default is 'ignore'), performance will be slightly better, and you will get a KeyError if not all elements of 'a' are present in the keys.

Assuming your dict keys are positive integers, without huge gaps (similar to a range from 0 to N), you would be better off converting your translation dict to an array such that my_array[i] = my_dict[i], and using numpy indexing to do the translation.

A code using this approach is:

def direct_translate(a, d):
src, values = d.keys(), d.values()
d_array = np.arange(a.max() + 1)
d_array[src] = values
return d_array[a]

Testing with random arrays:

N = 10000
shape = (5000, 5000)
a = np.random.randint(N, size=shape)
my_dict = dict(zip(np.arange(N), np.random.randint(N, size=N)))

For these sizes I get around 140 ms for this approach. The np.get vectorization takes around 5.8 s and the unique_translate around 8 s.

Possible generalizations:

  • If you have negative values to translate, you could shift the values in a and in the keys of the dictionary by a constant to map them back to positive integers:

def direct_translate(a, d): # handles negative source keys
min_a = a.min()
src, values = np.array(d.keys()) - min_a, d.values()
d_array = np.arange(a.max() - min_a + 1)
d_array[src] = values
return d_array[a - min_a]
  • If the source keys have huge gaps, the initial array creation would waste memory. I would resort to cython to speed up that function.