用于分析 Haskell 程序性能的工具

在解决一些学习 Haskell 的 Euler 项目问题时(因此目前我是一个完全的初学者) ,我学习了 问题12。我写了这个(天真的)解决方案:

--Get Number of Divisors of n
numDivs :: Integer -> Integer
numDivs n = toInteger $ length [ x | x<-[2.. ((n `quot` 2)+1)], n `rem` x == 0] + 2


--Generate a List of Triangular Values
triaList :: [Integer]
triaList =  [foldr (+) 0 [1..n] | n <- [1..]]


--The same recursive
triaList2 = go 0 1
where go cs n = (cs+n):go (cs+n) (n+1)


--Finds the first triangular Value with more than n Divisors
sol :: Integer -> Integer
sol n = head $ filter (\x -> numDivs(x)>n) triaList2

这个 n=500 (sol 500)的解决方案非常慢(现在运行超过2小时) ,所以我想知道如何找出这个解决方案为什么这么慢。有没有什么命令可以告诉我大部分的计算时间都花在了哪里,这样我就可以知道我的 haskell 程序哪部分是慢的?比如一个简单的侧写师。

为了说清楚,我不是要求 为了一个更快的解决方案,而是要求 一条路找到这个解决方案。如果你没有 Haskell 的知识,你会怎么开始?

我尝试编写两个 triaList函数,但没有办法测试哪一个更快,所以这是我的问题开始。

谢谢

17313 次浏览

Haskell related note: triaList2 is of course faster than triaList because the latter performs a lot of unnecessary computations. It will take quadratic time to compute n first elements of triaList, but linear for triaList2. There is another elegant (and efficient) way to define an infinite lazy list of triangle numbers:

triaList = 1 : zipWith (+) triaList [2..]

Math related note: there is no need to check all divisors up to n / 2, it's enough to check up to sqrt(n).

You can run your program with flags to enable time profiling. Something like this:

./program +RTS -P -sprogram.stats -RTS

That should run the program and produce a file called program.stats which will have how much time was spent in each function. You can find more information about profiling with GHC in the GHC user guide. For benchmarking, there is the Criterion library. I've found this blog post has a useful introduction.

how to find out why this solution is so slow. Are there any commands that tell me where most of the computation-time is spend so I know which part of my haskell-program is slow?

Precisely! GHC provides many excellent tools, including:

A tutorial on using time and space profiling is part of Real World Haskell.

GC Statistics

Firstly, ensure you're compiling with ghc -O2. And you might make sure it is a modern GHC (e.g. GHC 6.12.x)

The first thing we can do is check that garbage collection isn't the problem. Run your program with +RTS -s

$ time ./A +RTS -s
./A +RTS -s
749700
9,961,432,992 bytes allocated in the heap
2,463,072 bytes copied during GC
29,200 bytes maximum residency (1 sample(s))
187,336 bytes maximum slop
**2 MB** total memory in use (0 MB lost due to fragmentation)


Generation 0: 19002 collections,     0 parallel,  0.11s,  0.15s elapsed
Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed


INIT  time    0.00s  (  0.00s elapsed)
MUT   time   13.15s  ( 13.32s elapsed)
GC    time    0.11s  (  0.15s elapsed)
RP    time    0.00s  (  0.00s elapsed)
PROF  time    0.00s  (  0.00s elapsed)
EXIT  time    0.00s  (  0.00s elapsed)
Total time   13.26s  ( 13.47s elapsed)


%GC time       **0.8%**  (1.1% elapsed)


Alloc rate    757,764,753 bytes per MUT second


Productivity  99.2% of total user, 97.6% of total elapsed


./A +RTS -s  13.26s user 0.05s system 98% cpu 13.479 total

Which already gives us a lot of information: you only have a 2M heap, and GC takes up 0.8% of time. So no need to worry that allocation is the problem.

Time Profiles

Getting a time profile for your program is straight forward: compile with -prof -auto-all

 $ ghc -O2 --make A.hs -prof -auto-all
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...

And, for N=200:

$ time ./A +RTS -p
749700
./A +RTS -p  13.23s user 0.06s system 98% cpu 13.547 total

which creates a file, A.prof, containing:

    Sun Jul 18 10:08 2010 Time and Allocation Profiling Report  (Final)


A +RTS -p -RTS


total time  =     13.18 secs   (659 ticks @ 20 ms)
total alloc = 4,904,116,696 bytes  (excludes profiling overheads)


COST CENTRE          MODULE         %time %alloc


numDivs            Main         100.0  100.0

Indicating that all your time is spent in numDivs, and it is also the source of all your allocations.

Heap Profiles

You can also get a break down of those allocations, by running with +RTS -p -hy, which creates A.hp, which you can view by converting it to a postscript file (hp2ps -c A.hp), generating:

alt text

which tells us there's nothing wrong with your memory use: it is allocating in constant space.

So your problem is algorithmic complexity of numDivs:

toInteger $ length [ x | x<-[2.. ((n `quot` 2)+1)], n `rem` x == 0] + 2

Fix that, which is 100% of your running time, and everything else is easy.

Optimizations

This expression is a good candidate for the stream fusion optimization, so I'll rewrite it to use Data.Vector, like so:

numDivs n = fromIntegral $
2 + (U.length $
U.filter (\x -> fromIntegral n `rem` x == 0) $
(U.enumFromN 2 ((fromIntegral n `div` 2) + 1) :: U.Vector Int))

Which should fuse into a single loop with no unnecessary heap allocations. That is, it will have better complexity (by constant factors) than the list version. You can use the ghc-core tool (for advanced users) to inspect the intermediate code after optimization.

Testing this, ghc -O2 --make Z.hs

$ time ./Z
749700
./Z  3.73s user 0.01s system 99% cpu 3.753 total

So it reduced running time for N=150 by 3.5x, without changing the algorithm itself.

Conclusion

Your problem is numDivs. It is 100% of your running time, and has terrible complexity. Think about numDivs, and how, for example, for each N you are generating [2 .. n div 2 + 1] N times. Try memoizing that, since the values don't change.

To measure which of your functions is faster, consider using criterion, which will provide statistically robust information about sub-microsecond improvements in running time.


Addenda

Since numDivs is 100% of your running time, touching other parts of the program won't make much difference, however, for pedagogical purposes, we can also rewrite those using stream fusion.

We can also rewrite trialList, and rely on fusion to turn it into the loop you write by hand in trialList2, which is a "prefix scan" function (aka scanl):

triaList = U.scanl (+) 0 (U.enumFrom 1 top)
where
top = 10^6

Similarly for sol:

sol :: Int -> Int
sol n = U.head $ U.filter (\x -> numDivs x > n) triaList

With the same overall running time, but a bit cleaner code.

Dons' answer is great without being a spoiler by giving a direct solution to the problem.
Here I want to suggest a little tool that I wrote recently. It saves you the time to write SCC annotations by hand when you want a more detailed profile than the default ghc -prof -auto-all. Besides that it's colorful!

Here's an example with the code you gave(*), green is OK, red is slow: alt text

All the time goes in creating the list of divisors. This suggests a few things you can do:
1. Make the filtering n rem x == 0 faster, but since it's a built-in function probably it's already fast.
2. Create a shorter list. You've already done something in that direction by checking only up to n quot 2.
3. Throw away the list generation completely and use some math to get a faster solution. This is the usual way for project Euler problems.

(*) I got this by putting your code in a file called eu13.hs, adding a main function main = print $ sol 90. Then running visual-prof -px eu13.hs eu13 and the result is in eu13.hs.html.