增长数字数组的最快方法

要求:

  • 我需要从数据中增长一个任意大小的数组。
  • 我可以猜测大小(大约100-200) ,但不能保证数组每次都适合
  • 一旦它增长到最终的大小,我需要对它执行数值计算,所以我希望最终得到一个2-D numpy 数组。
  • 速度至关重要。例如,对于300个文件中的一个,update ()方法被调用4500万次(大约需要150s) ,finalize ()方法被调用500k 次(总共需要106s) ... ... 大约需要250s。

这是我的代码:

def __init__(self):
self.data = []


def update(self, row):
self.data.append(row)


def finalize(self):
dx = np.array(self.data)

我尝试过的其他方法包括下面的代码... 但是这个比较慢。

def class A:
def __init__(self):
self.data = np.array([])


def update(self, row):
np.append(self.data, row)


def finalize(self):
dx = np.reshape(self.data, size=(self.data.shape[0]/5, 5))

下面是这种方法的示意图:

for i in range(500000):
ax = A()
for j in range(200):
ax.update([1,2,3,4,5])
ax.finalize()
# some processing on ax
109320 次浏览

I tried a few different things, with timing.

import numpy as np
  1. The method you mention as slow: (32.094 seconds)

    class A:
    
    
    def __init__(self):
    self.data = np.array([])
    
    
    def update(self, row):
    self.data = np.append(self.data, row)
    
    
    def finalize(self):
    return np.reshape(self.data, newshape=(self.data.shape[0]/5, 5))
    
  2. Regular ol Python list: (0.308 seconds)

    class B:
    
    
    def __init__(self):
    self.data = []
    
    
    def update(self, row):
    for r in row:
    self.data.append(r)
    
    
    def finalize(self):
    return np.reshape(self.data, newshape=(len(self.data)/5, 5))
    
  3. Trying to implement an arraylist in numpy: (0.362 seconds)

    class C:
    
    
    def __init__(self):
    self.data = np.zeros((100,))
    self.capacity = 100
    self.size = 0
    
    
    def update(self, row):
    for r in row:
    self.add(r)
    
    
    def add(self, x):
    if self.size == self.capacity:
    self.capacity *= 4
    newdata = np.zeros((self.capacity,))
    newdata[:self.size] = self.data
    self.data = newdata
    
    
    self.data[self.size] = x
    self.size += 1
    
    
    def finalize(self):
    data = self.data[:self.size]
    return np.reshape(data, newshape=(len(data)/5, 5))
    

And this is how I timed it:

x = C()
for i in xrange(100000):
x.update([i])

So it looks like regular old Python lists are pretty good ;)

np.append() copy all the data in the array every time, but list grow the capacity by a factor (1.125). list is fast, but memory usage is larger than array. You can use array module of the python standard library if you care about the memory.

Here is a discussion about this topic:

How to create a dynamic array

Using the class declarations in Owen's post, here is a revised timing with some effect of the finalize.

In short, I find class C to provide an implementation that is over 60x faster than the method in the original post. (apologies for the wall of text)

The file I used:

#!/usr/bin/python
import cProfile
import numpy as np


# ... class declarations here ...


def test_class(f):
x = f()
for i in xrange(100000):
x.update([i])
for i in xrange(1000):
x.finalize()


for x in 'ABC':
cProfile.run('test_class(%s)' % x)

Now, the resulting timings:

A:

     903005 function calls in 16.049 seconds


Ordered by: standard name


ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.000    0.000   16.049   16.049 <string>:1(<module>)
100000    0.139    0.000    1.888    0.000 fromnumeric.py:1043(ravel)
1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)
100000    0.322    0.000   14.424    0.000 function_base.py:3466(append)
100000    0.102    0.000    1.623    0.000 numeric.py:216(asarray)
100000    0.121    0.000    0.298    0.000 numeric.py:286(asanyarray)
1000    0.002    0.000    0.004    0.000 test.py:12(finalize)
1    0.146    0.146   16.049   16.049 test.py:50(test_class)
1    0.000    0.000    0.000    0.000 test.py:6(__init__)
100000    1.475    0.000   15.899    0.000 test.py:9(update)
1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
100000    0.126    0.000    0.126    0.000 {method 'ravel' of 'numpy.ndarray' objects}
1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}
200001    1.698    0.000    1.698    0.000 {numpy.core.multiarray.array}
100000   11.915    0.000   11.915    0.000 {numpy.core.multiarray.concatenate}

B:

     208004 function calls in 16.885 seconds


Ordered by: standard name


ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.001    0.001   16.885   16.885 <string>:1(<module>)
1000    0.025    0.000   16.508    0.017 fromnumeric.py:107(reshape)
1000    0.013    0.000   16.483    0.016 fromnumeric.py:32(_wrapit)
1000    0.007    0.000   16.445    0.016 numeric.py:216(asarray)
1    0.000    0.000    0.000    0.000 test.py:16(__init__)
100000    0.068    0.000    0.080    0.000 test.py:19(update)
1000    0.012    0.000   16.520    0.017 test.py:23(finalize)
1    0.284    0.284   16.883   16.883 test.py:50(test_class)
1000    0.005    0.000    0.005    0.000 {getattr}
1000    0.001    0.000    0.001    0.000 {len}
100000    0.012    0.000    0.012    0.000 {method 'append' of 'list' objects}
1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
1000    0.020    0.000    0.020    0.000 {method 'reshape' of 'numpy.ndarray' objects}
1000   16.438    0.016   16.438    0.016 {numpy.core.multiarray.array}

C:

     204010 function calls in 0.244 seconds


Ordered by: standard name


ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.000    0.000    0.244    0.244 <string>:1(<module>)
1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)
1    0.000    0.000    0.000    0.000 test.py:27(__init__)
100000    0.082    0.000    0.170    0.000 test.py:32(update)
100000    0.087    0.000    0.088    0.000 test.py:36(add)
1000    0.002    0.000    0.005    0.000 test.py:46(finalize)
1    0.068    0.068    0.243    0.243 test.py:50(test_class)
1000    0.000    0.000    0.000    0.000 {len}
1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}
6    0.001    0.000    0.001    0.000 {numpy.core.multiarray.zeros}

Class A is destroyed by the updates, class B is destroyed by the finalizes. Class C is robust in the face of both of them.

there is a big performance difference in the function that you use for finalization. Consider the following code:

N=100000
nruns=5


a=[]
for i in range(N):
a.append(np.zeros(1000))


print "start"


b=[]
for i in range(nruns):
s=time()
c=np.vstack(a)
b.append((time()-s))
print "Timing version vstack ",np.mean(b)


b=[]
for i in range(nruns):
s=time()
c1=np.reshape(a,(N,1000))
b.append((time()-s))


print "Timing version reshape ",np.mean(b)


b=[]
for i in range(nruns):
s=time()
c2=np.concatenate(a,axis=0).reshape(-1,1000)
b.append((time()-s))


print "Timing version concatenate ",np.mean(b)


print c.shape,c2.shape
assert (c==c2).all()
assert (c==c1).all()

Using concatenate seems to be twice as fast as the first version and more than 10 times faster than the second version.

Timing version vstack  1.5774928093
Timing version reshape  9.67419199944
Timing version concatenate  0.669512557983

If you want improve performance with list operations, have a look to blist library. It is a optimized implementation of python list and other structures.

I didn't benchmark it yet but the results in their page seem promising.

Multiple dimensional numpy arrays

Adding to Owen's and Prashant Kumar post a version using multiple dimensional numpy arrays (aka. shape) speeds up the code for the numpy solutions. Especially if you need to access (finalize()) the data often.

Version Prashant Kumar row_length=1 row_length=5
Class A - np.append 2.873 s 2.776 s 0.682 s
Class B - python list 6.693 s 80.868 s 22.012 s
Class C - arraylist 0.095 s 0.180 s 0.043 s

The column Prashant Kumar is his example executed on my machine to give a comparison. With row_length=5 it is the example of the initial question. The dramatic increase in the python list, comes from {built-in method numpy.array}, which means numpy needs a lot more time to convert a multiple dimensional list of lists to an array in respect to a 1D list and reshape it where both have the same number entries, e.g. np.array([[1,2,3]*5]) vs. np.array([1]*15).reshape((-1,3)).

And this is the code:

import cProfile
import numpy as np


class A:
def __init__(self,shape=(0,), dtype=float):
"""First item of shape is ingnored, the rest defines the shape"""
self.data = np.array([], dtype=dtype).reshape((0,*shape[1:]))


def update(self, row):
self.data = np.append(self.data, row)


def finalize(self):
return self.data
    

    

class B:
def __init__(self, shape=(0,), dtype=float):
"""First item of shape is ingnored, the rest defines the shape"""
self.shape = shape
self.dtype = dtype
self.data = []


def update(self, row):
self.data.append(row)


def finalize(self):
return np.array(self.data, dtype=self.dtype).reshape((-1, *self.shape[1:]))
    

    

class C:
def __init__(self, shape=(0,), dtype=float):
"""First item of shape is ingnored, the rest defines the shape"""
self.shape = shape
self.data = np.zeros((100,*shape[1:]),dtype=dtype)
self.capacity = 100
self.size = 0


def update(self, x):
if self.size == self.capacity:
self.capacity *= 4
newdata = np.zeros((self.capacity,*self.data.shape[1:]))
newdata[:self.size] = self.data
self.data = newdata


self.data[self.size] = x
self.size += 1


def finalize(self):
return self.data[:self.size]
    



def test_class(f):
row_length = 5
x = f(shape=(0,row_length))
for i in range(int(100000/row_length)):
x.update([i]*row_length)
for i in range(1000):
x.finalize()


for x in 'ABC':
cProfile.run('test_class(%s)' % x)


And another option to add to the post above from Luca Fiaschi.

b=[]
for i in range(nruns):
s=time.time()
c1=np.array(a, dtype=int).reshape((N,1000))
b.append((time.time()-s))
    

print("Timing version array.reshape ",np.mean(b))

gives for me:

Timing version vstack         0.6863266944885253
Timing version reshape        0.505419111251831
Timing version array.reshape  0.5052066326141358
Timing version concatenate    0.5339600563049316