懒惰 I/O 有什么不好的?

我通常听说生产代码应该避免使用惰性 I/O,我的问题是,为什么?使用惰性 I/O 是否可以避免只是玩玩而已?什么使替代品(例如枚举器)更好?

12026 次浏览

Lazy IO has the problem that releasing whatever resource you have acquired is somewhat unpredictable, as it depends on how your program consumes the data -- its "demand pattern". Once your program drops the last reference to the resource, the GC will eventually run and release that resource.

Lazy streams are a very convenient style to program in. This is why shell pipes are so fun and popular.

However, if resources are constrained (as in high-performance scenarios, or production environments that expect to scale to the limits of the machine) relying on the GC to clean up can be an insufficient guarantee.

Sometimes you have to release resources eagerly, in order to improve scalability.

So what are the alternatives to lazy IO that don't mean giving up on incremental processing (which in turn would consume too many resources)? Well, we have foldl based processing, aka iteratees or enumerators, introduced by Oleg Kiselyov in the late 2000s, and since popularized by a number of networking-based projects.

Instead of processing data as lazy streams, or in one huge batch, we instead abstract over chunk-based strict processing, with guaranteed finalization of the resource once the last chunk is read. That's the essence of iteratee-based programming, and one that offers very nice resource constraints.

The downside of iteratee-based IO is that it has a somewhat awkward programming model (roughly analogous to event-based programming, versus nice thread-based control). It is definitely an advanced technique, in any programming language. And for the vast majority of programming problems, lazy IO is entirely satisfactory. However, if you will be opening many files, or talking on many sockets, or otherwise using many simultaneous resources, an iteratee (or enumerator) approach might make sense.

I use lazy I/O in production code all the time. It's only a problem in certain circumstances, like Don mentioned. But for just reading a few files it works fine.

Dons has provided a very good answer, but he's left out what is (for me) one of the most compelling features of iteratees: they make it easier to reason about space management because old data must be explicitly retained. Consider:

average :: [Float] -> Float
average xs = sum xs / length xs

This is a well-known space leak, because the entire list xs must be retained in memory to calculate both sum and length. It's possible to make an efficient consumer by creating a fold:

average2 :: [Float] -> Float
average2 xs = uncurry (/) <$> foldl (\(sumT, n) x -> (sumT+x, n+1)) (0,0) xs
-- N.B. this will build up thunks as written, use a strict pair and foldl'

But it's somewhat inconvenient to have to do this for every stream processor. There are some generalizations (Conal Elliott - Beautiful Fold Zipping), but they don't seem to have caught on. However, iteratees can get you a similar level of expression.

aveIter = uncurry (/) <$> I.zip I.sum I.length

This isn't as efficient as a fold because the list is still iterated over multiple times, however it's collected in chunks so old data can be efficiently garbage collected. In order to break that property, it's necessary to explicitly retain the entire input, such as with stream2list:

badAveIter = (\xs -> sum xs / length xs) <$> I.stream2list

The state of iteratees as a programming model is a work in progress, however it's much better than even a year ago. We're learning what combinators are useful (e.g. zip, breakE, enumWith) and which are less so, with the result that built-in iteratees and combinators provide continually more expressivity.

That said, Dons is correct that they're an advanced technique; I certainly wouldn't use them for every I/O problem.

Another problem with lazy IO that hasn't been mentioned so far is that it has surprising behaviour. In a normal Haskell program, it can sometimes be difficult to predict when each part of your program is evaluated, but fortunately due to purity it really doesn't matter unless you have performance problems. When lazy IO is introduced, the evaluation order of your code actually has an effect on its meaning, so changes that you're used to thinking of as harmless can cause you genuine problems.

As an example, here's a question about code that looks reasonable but is made more confusing by deferred IO: withFile vs. openFile

These problems aren't invariably fatal, but it's another thing to think about, and a sufficiently severe headache that I personally avoid lazy IO unless there's a real problem with doing all the work upfront.

Update: Recently on haskell-cafe Oleg Kiseljov showed that unsafeInterleaveST (which is used for implementing lazy IO within the ST monad) is very unsafe - it breaks equational reasoning. He shows that it allows to construct bad_ctx :: ((Bool,Bool) -> Bool) -> Bool such that

> bad_ctx (\(x,y) -> x == y)
True
> bad_ctx (\(x,y) -> y == x)
False

even though == is commutative.


Another problem with lazy IO: The actual IO operation can be deferred until it's too late, for example after the file is closed. Quoting from Haskell Wiki - Problems with lazy IO:

For example, a common beginner mistake is to close a file before one has finished reading it:

wrong = do
fileData <- withFile "test.txt" ReadMode hGetContents
putStr fileData

The problem is withFile closes the handle before fileData is forced. The correct way is to pass all the code to withFile:

right = withFile "test.txt" ReadMode $ \handle -> do
fileData <- hGetContents handle
putStr fileData

Here, the data is consumed before withFile finishes.

This is often unexpected and an easy-to-make error.


See also: Three examples of problems with Lazy I/O.

What's so bad about lazy I/O is that you, the programmer, have to micro-manage certain resources instead of the implementation. For example, which of the following is "different"?

  • freeSTRef :: STRef s a -> ST s ()
  • closeIORef :: IORef a -> IO ()
  • endMVar :: MVar a -> IO ()
  • discardTVar :: TVar -> STM ()
  • hClose :: Handle -> IO ()
  • finalizeForeignPtr :: ForeignPtr a -> IO ()

...out of all these dismissive definitions, the last two - hClose and finalizeForeignPtr - actually do exist. As for the rest, what service they could provide in the language is much more reliably performed by the implementation!

So if the dismissing of resources like file handles and foreign references was also left to the implementation, lazy I/O would probably be no worse than lazy evaluation.