为什么使用 -O 时 Haskell 代码运行得更慢?

这段 Haskell 代码使用 -O运行 很多比较慢,但是 -O应该是 不危险。有人能告诉我发生了什么吗?如果有必要的话,这是一个解决 这个问题的尝试,它使用了二进制搜索和持久段树:

import Control.Monad
import Data.Array


data Node =
Leaf   Int           -- value
| Branch Int Node Node -- sum, left child, right child
type NodeArray = Array Int Node


-- create an empty node with range [l, r)
create :: Int -> Int -> Node
create l r
| l + 1 == r = Leaf 0
| otherwise  = Branch 0 (create l m) (create m r)
where m = (l + r) `div` 2


-- Get the sum in range [0, r). The range of the node is [nl, nr)
sumof :: Node -> Int -> Int -> Int -> Int
sumof (Leaf val) r nl nr
| nr <= r   = val
| otherwise = 0
sumof (Branch sum lc rc) r nl nr
| nr <= r   = sum
| r  > nl   = (sumof lc r nl m) + (sumof rc r m nr)
| otherwise = 0
where m = (nl + nr) `div` 2


-- Increase the value at x by 1. The range of the node is [nl, nr)
increase :: Node -> Int -> Int -> Int -> Node
increase (Leaf val) x nl nr = Leaf (val + 1)
increase (Branch sum lc rc) x nl nr
| x < m     = Branch (sum + 1) (increase lc x nl m) rc
| otherwise = Branch (sum + 1) lc (increase rc x m nr)
where m = (nl + nr) `div` 2


-- signature said it all
tonodes :: Int -> [Int] -> [Node]
tonodes n = reverse . tonodes' . reverse
where
tonodes' :: [Int] -> [Node]
tonodes' (h:t) = increase h' h 0 n : s' where s'@(h':_) = tonodes' t
tonodes' _ = [create 0 n]


-- find the minimum m in [l, r] such that (predicate m) is True
binarysearch :: (Int -> Bool) -> Int -> Int -> Int
binarysearch predicate l r
| l == r      = r
| predicate m = binarysearch predicate l m
| otherwise   = binarysearch predicate (m+1) r
where m = (l + r) `div` 2


-- main, literally
main :: IO ()
main = do
[n, m] <- fmap (map read . words) getLine
nodes <- fmap (listArray (0, n) . tonodes n . map (subtract 1) . map read . words) getLine
replicateM_ m $ query n nodes
where
query :: Int -> NodeArray -> IO ()
query n nodes = do
[p, k] <- fmap (map read . words) getLine
print $ binarysearch (ok nodes n p k) 0 n
where
ok :: NodeArray -> Int -> Int -> Int -> Int -> Bool
ok nodes n p k s = (sumof (nodes ! min (p + s + 1) n) s 0 n) - (sumof (nodes ! max (p - s) 0) s 0 n) >= k

(这与 代码检查的代码完全相同,但这个问题解决了另一个问题。)

这是我在 C + + 中的输入生成器:

#include <cstdio>
#include <cstdlib>
using namespace std;
int main (int argc, char * argv[]) {
srand(1827);
int n = 100000;
if(argc > 1)
sscanf(argv[1], "%d", &n);
printf("%d %d\n", n, n);
for(int i = 0; i < n; i++)
printf("%d%c", rand() % n + 1, i == n - 1 ? '\n' : ' ');
for(int i = 0; i < n; i++) {
int p = rand() % n;
int k = rand() % n + 1;
printf("%d %d\n", p, k);
}
}

如果您没有可用的 C + + 编译器,请使用 这是 ./gen.exe 1000的结果

这是我电脑上的执行结果:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.3
$ ghc -fforce-recomp 1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ time ./gen.exe 1000 | ./1827.exe > /dev/null
real    0m0.088s
user    0m0.015s
sys     0m0.015s
$ ghc -fforce-recomp -O 1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ time ./gen.exe 1000 | ./1827.exe > /dev/null
real    0m2.969s
user    0m0.000s
sys     0m0.045s

这是堆概要的总结:

$ ghc -fforce-recomp -rtsopts ./1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null
70,207,096 bytes allocated in the heap
2,112,416 bytes copied during GC
613,368 bytes maximum residency (3 sample(s))
28,816 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed)  Avg pause  Max pause
Gen  0       132 colls,     0 par    0.00s    0.00s     0.0000s    0.0004s
Gen  1         3 colls,     0 par    0.00s    0.00s     0.0006s    0.0010s
INIT    time    0.00s  (  0.00s elapsed)
MUT     time    0.03s  (  0.03s elapsed)
GC      time    0.00s  (  0.01s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    0.03s  (  0.04s elapsed)
%GC     time       0.0%  (14.7% elapsed)
Alloc rate    2,250,213,011 bytes per MUT second
Productivity 100.0% of total user, 83.1% of total elapsed
$ ghc -fforce-recomp -O -rtsopts ./1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null
6,009,233,608 bytes allocated in the heap
622,682,200 bytes copied during GC
443,240 bytes maximum residency (505 sample(s))
48,256 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed)  Avg pause  Max pause
Gen  0     10945 colls,     0 par    0.72s    0.63s     0.0001s    0.0004s
Gen  1       505 colls,     0 par    0.16s    0.13s     0.0003s    0.0005s
INIT    time    0.00s  (  0.00s elapsed)
MUT     time    2.00s  (  2.13s elapsed)
GC      time    0.87s  (  0.76s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    2.89s  (  2.90s elapsed)
%GC     time      30.3%  (26.4% elapsed)
Alloc rate    3,009,412,603 bytes per MUT second
Productivity  69.7% of total user, 69.4% of total elapsed
3432 次浏览

What happened to your code with -O

Let me zoom in your main function, and rewrite it slightly:

main :: IO ()
main = do
[n, m] <- fmap (map read . words) getLine
line <- getLine
let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line
replicateM_ m $ query n nodes

Clearly, the intention here is that the NodeArray is created once, and then used in every of the m invocations of query.

Unfortunately, GHC transforms this code to, effectively,

main = do
[n, m] <- fmap (map read . words) getLine
line <- getLine
replicateM_ m $ do
let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line
query n nodes

and you can immediately see the problem here.

What is the state hack, and why does it destroy my programs performance

The reason is the state hack, which says (roughly): “When something is of type IO a, assume it is called only once.”. The official documentation is not much more elaborate:

-fno-state-hack

Turn off the "state hack" whereby any lambda with a State# token as argument is considered to be single-entry, hence it is considered OK to inline things inside it. This can improve performance of IO and ST monad code, but it runs the risk of reducing sharing.

Roughly, the idea is as follows: If you define a function with an IO type and a where clause, e.g.

foo x = do
putStrLn y
putStrLn y
where y = ...x...

Something of type IO a can be viewed as something of type RealWord -> (a, RealWorld). In that view, the above becomes (roughly)

foo x =
let y = ...x... in
\world1 ->
let (world2, ()) = putStrLn y world1
let (world3, ()) = putStrLn y world2
in  (world3, ())

A call to foo would (typically) look like this foo argument world. But the definition of foo only takes one argument, and the other one is only consumed later by a local lambda expression! That is going to be a very slow call to foo. It would be much faster if the code would look like this:

foo x world1 =
let y = ...x... in
let (world2, ()) = putStrLn y world1
let (world3, ()) = putStrLn y world2
in  (world3, ())

This is called eta-expansion and done on various grounds (e.g. by analyzing the function’s definition, by checking how it is being called, and – in this case – type directed heuristics).

Unfortunately, this degrades performance if the call to foo is actually of the form let fooArgument = foo argument, i.e. with an argument, but no world passed (yet). In the original code, if fooArgument is then used several times, y will still be calculated only once, and shared. In the modified code, y will be re-calculated every time – precisely what has happened to your nodes.

Can things be fixed?

Possibly. See #9388 for an attempt at doing so. The problem with fixing it is that it will cost performance in a lot of cases where the transformation happens to ok, even though the compiler cannot possibly know that for sure. And there are probably cases where it is technically not ok, i.e. sharing is lost, but it is still beneficial because the speedups from the faster calling outweigh the extra cost of the recalculation. So it is not clear where to go from here.