Arrayfun 可能比 matlab 中的显式循环慢得多。为什么?

考虑以下 arrayfun的简单速度测试:

T = 4000;
N = 500;
x = randn(T, N);
Func1 = @(a) (3*a^2 + 2*a - 1);


tic
Soln1 = ones(T, N);
for t = 1:T
for n = 1:N
Soln1(t, n) = Func1(x(t, n));
end
end
toc


tic
Soln2 = arrayfun(Func1, x);
toc

在我的机器上(Linux Mint 12上的 Matlab 2011b) ,这个测试的输出是:

Elapsed time is 1.020689 seconds.
Elapsed time is 9.248388 seconds.

什么? ! ? abc 0,虽然公认看起来更干净的解决方案,是一个数量级慢。这是怎么回事?

此外,我对 cellfun进行了类似的测试,发现它比显式循环慢3倍左右。同样,这个结果与我的预期相反。

我的问题是: 为什么 arrayfuncellfun这么慢?考虑到这一点,使用它们有什么好的理由吗(除了让代码看起来不错之外) ?

注意: 我在这里谈论的是 arrayfun的标准版本,而不是并行处理工具箱中的 GPU 版本。

编辑: 只是为了清楚,我知道上面的 Func1可以像 Oli 指出的那样向量化。我之所以选择它,是因为它为实际问题提供了一个简单的速度测试。

编辑: 根据 grungetta 的建议,我用 feature accel off重新做了测试,结果是:

Elapsed time is 28.183422 seconds.
Elapsed time is 23.525251 seconds.

换句话说,很大一部分区别似乎在于 JIT 加速器在加速显式 for循环方面比 arrayfun做得更好。这对我来说似乎很奇怪,因为 arrayfun实际上提供了更多的信息,也就是说,它的使用表明对 Func1的调用顺序并不重要。此外,我注意到,无论是开启或关闭 JIT 加速器,我的系统只使用一个 CPU..。

22637 次浏览

You can get the idea by running other versions of your code. Consider explicitly writing out the computations, instead of using a function in your loop

tic
Soln3 = ones(T, N);
for t = 1:T
for n = 1:N
Soln3(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
end
end
toc

Time to compute on my computer:

Soln1  1.158446 seconds.
Soln2  10.392475 seconds.
Soln3  0.239023 seconds.
Oli    0.010672 seconds.

Now, while the fully 'vectorized' solution is clearly the fastest, you can see that defining a function to be called for every x entry is a huge overhead. Just explicitly writing out the computation got us factor 5 speedup. I guess this shows that MATLABs JIT compiler does not support inline functions. According to the answer by gnovice there, it is actually better to write a normal function rather than an anonymous one. Try it.

Next step - remove (vectorize) the inner loop:

tic
Soln4 = ones(T, N);
for t = 1:T
Soln4(t, :) = 3*x(t, :).^2 + 2*x(t, :) - 1;
end
toc


Soln4  0.053926 seconds.

Another factor 5 speedup: there is something in those statements saying you should avoid loops in MATLAB... Or is there really? Have a look at this then

tic
Soln5 = ones(T, N);
for n = 1:N
Soln5(:, n) = 3*x(:, n).^2 + 2*x(:, n) - 1;
end
toc


Soln5   0.013875 seconds.

Much closer to the 'fully' vectorized version. Matlab stores matrices column-wise. You should always (when possible) structure your computations to be vectorized 'column-wise'.

We can go back to Soln3 now. The loop order there is 'row-wise'. Lets change it

tic
Soln6 = ones(T, N);
for n = 1:N
for t = 1:T
Soln6(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
end
end
toc


Soln6  0.201661 seconds.

Better, but still very bad. Single loop - good. Double loop - bad. I guess MATLAB did some decent work on improving the performance of loops, but still the loop overhead is there. If you would have some heavier work inside, you would not notice. But since this computation is memory bandwidth bounded, you do see the loop overhead. And you will even more clearly see the overhead of calling Func1 there.

So what's up with arrayfun? No function inlinig there either, so a lot of overhead. But why so much worse than a double nested loop? Actually, the topic of using cellfun/arrayfun has been extensively discussed many times (e.g. here, here, here and here). These functions are simply slow, you can not use them for such fine-grain computations. You can use them for code brevity and fancy conversions between cells and arrays. But the function needs to be heavier than what you wrote:

tic
Soln7 = arrayfun(@(a)(3*x(:,a).^2 + 2*x(:,a) - 1), 1:N, 'UniformOutput', false);
toc


Soln7  0.016786 seconds.

Note that Soln7 is a cell now.. sometimes that is useful. Code performance is quite good now, and if you need cell as output, you do not need to convert your matrix after you have used the fully vectorized solution.

So why is arrayfun slower than a simple loop structure? Unfortunately, it is impossible for us to say for sure, since there is no source code available. You can only guess that since arrayfun is a general purpose function, which handles all kinds of different data structures and arguments, it is not necessarily very fast in simple cases, which you can directly express as loop nests. Where does the overhead come from we can not know. Could the overhead be avoided by a better implementation? Maybe not. But unfortunately the only thing we can do is study the performance to identify the cases, in which it works well, and those, where it doesn't.

Update Since the execution time of this test is short, to get reliable results I added now a loop around the tests:

for i=1:1000
% compute
end

Some times given below:

Soln5   8.192912 seconds.
Soln7  13.419675 seconds.
Oli     8.089113 seconds.

You see that the arrayfun is still bad, but at least not three orders of magnitude worse than the vectorized solution. On the other hand, a single loop with column-wise computations is as fast as the fully vectorized version... That was all done on a single CPU. Results for Soln5 and Soln7 do not change if I switch to 2 cores - In Soln5 I would have to use a parfor to get it parallelized. Forget about speedup... Soln7 does not run in parallel because arrayfun does not run in parallel. Olis vectorized version on the other hand:

Oli  5.508085 seconds.

That because!!!!

x = randn(T, N);

is not gpuarray type;

All you need to do is

x = randn(T, N,'gpuArray');