文本对字符串

虽然 Haskell 社区的普遍观点似乎是,使用 Text总是比使用 String好,但是大多数维护库的 API 仍然是面向 String的,这一事实让我十分困惑。另一方面,还有 值得注意的项目,它认为 String完全是一个错误,并提供一个 Prelude,其中所有面向 String的函数实例都被它们的 Text对应实例所替代。

那么,除了向后和标准的序曲兼容性和“开关制造惯性”之外,人们还有什么理由继续编写面向 String的 API 呢? 与 String相比,Text还有什么其他的缺点吗?

我对此特别感兴趣,因为我正在设计一个库,并试图决定使用哪种类型来表达错误消息。

18288 次浏览

My unqualified guess is that most library writers don't want to add more dependencies than necessary. Since strings are part of literally every Haskell distribution (it's part of the language standard!), it is a lot easier to get adopted if you use strings and don't require your users to sort out Text distributions from hackage.

It's one of those "design mistakes" that you just have to live with unless you can convince most of the community to switch over night. Just look at how long it has taken to get Applicative to be a superclass of Monad – a relatively minor but much wanted change – and imagine how long it would take to replace all the String things with Text.


To answer your more specific question: I would go with String unless you get noticeable performance benefits by using Text. Error messages are usually rather small one-off things so it shouldn't be a big problem to use String.

On the other hand, if you are the kind of ideological purist that eschews pragmatism for idealism, go with Text.


* I put design mistakes in scare quotes because strings as a list-of-chars is a neat property that makes them easy to reason about and integrate with other existing list-operating functions.

If your API is targeted at processing large amounts of character oriented data and/or various encodings, then your API should use Text.

If your API is primarily for dealing with small one-off strings, then using the built-in String type should be fine.

Using String for large amounts of text will make applications using your API consume significantly more memory. Using it with foreign encodings could seriously complicate usage depending on how your API works.

String is quite expensive (at least 5N words where N is the number of Char in the String). A word is same number of bits as the processor architecture (ex. 32 bits or 64 bits): http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html

There are at least three reasons to use [Char] in small projects.

  1. [Char] does not rely on any arcane staff, like foreign pointers, raw memory, raw arrays, etc that may work differently on different platforms or even be unavailable altogether

  2. [Char] is the lingua franka in haskell. There are at least three 'efficient' ways to handle unicode data in haskell: utf8-bytestring, Data.Text.Text and Data.Vector.Unboxed.Vector Char, each requiring dealing with extra package.

  3. by using [Char] one gains access to all power of [] monad, including many specific functions (alternative string packages do try to help with it, but still)

Personally, I consider utf16-based Data.Text one of the most questionable desicions of the haskell community, since utf16 combines flaws of both utf8 and utf32 encoding while having none of their benefits.

I do not think there is a single technical reason for String to remain. And I can see several ones for it to go.

Overall I would first argue that in the Text/String case there is only one best solution :

  • String performances are bad, everyone agrees on that

  • Text is not difficult to use. All functions commonly used on String are available on Text, plus some useful more in the context of strings (substitution, padding, encoding)

  • having two solutions creates unnecessary complexity unless all base functions are made polymorphic. Proof : there are SO questions on the subject of automatic conversions. So this is a problem.

So one solution is less complex than two, and the shortcomings of String will make it disappear eventually. The sooner the better !

I wonder if Data.Text is always more efficient than Data.String???

"cons" for instance is O(1) for Strings and O(n) for Text. Append is O(n) for Strings and O(n+m) for strict Text's. Likewise,

    let foo = "foo" ++ bigchunk
bar = "bar" ++ bigchunk

is more space efficient for Strings than for strict Texts.

Other issue not related to efficiency is pattern matching (perspicuous code) and lazyness (predictably per-character in Strings, somehow implementation dependent in lazy Text).

Text's are obviously good for static character sequences and for in-place modification. For other forms of structural editing, Data.String might have advantages.