Legality of COW std::string implementation in C++11

It had been my understanding that copy-on-write is not a viable way to implement a conforming std::string in C++11, but when it came up in discussion recently I found myself unable to directly support that statement.

Am I correct that C++11 does not admit COW based implementations of std::string?

If so, is this restriction explicitly stated somewhere in the new standard (where)?

Or is this restriction implied, in the sense that it is the combined effect of the new requirements on std::string that precludes a COW based implementation of std::string. In this case, I'd be interested in a chapter and verse style derivation of 'C++11 effectively prohibits COW based std::string implementations'.

34161 次浏览

Since it is now guaranteed that strings are stored contiguously and you are now allowed to take a pointer to the internal storage of a string, (i.e. &str[0] works like it would for an array), it's not possible to make a useful COW implementation. You would have to make a copy for way too many things. Even just using operator[] or begin() on a non-const string would require a copy.

It is, CoW is an acceptable mechanism for making faster strings... but...

it makes multithreading code slower (all that locking to check if you're the only one writing kills performance when using a lot of strings). This was the main reason CoW was killed off years ago.

The other reasons are that the [] operator will return you the string data, without any protection for you to overwrite a string someone else expects to be unchanging. The same applies to c_str() and data().

Quick google says that the multithreading is basically the reason it was effectively disallowed (not explicitly).

The proposal says :

Proposal

We propose to make all iterator and element access operations safely concurrently executable.

We are increasing the stability of operations even in sequential code.

This change effectively disallows copy-on-write implementations.

followed by

The largest potential loss in performance due to a switch away from copy-on-write implementations is the increased consumption of memory for applications with very large read-mostly strings. However, we believe that for those applications ropes are a better technical solution, and recommend a rope proposal be considered for inclusion in Library TR2.

Ropes are part of STLPort and SGIs STL.

From 21.4.2 basic_string constructors and assignment operators [string.cons]

basic_string(const basic_string<charT,traits,Allocator>& str);

[...]

2 Effects: Constructs an object of class basic_string as indicated in Table 64. [...]

Table 64 helpfully documents that after construction of an object via this (copy) constructor, this->data() has as value:

points at the first element of an allocated copy of the array whose first element is pointed at by str.data()

There are similar requirements for other similar constructors.

It's not allowed, because as per the standard 21.4.1 p6, invalidation of iterators/references is only allowed for

— as an argument to any standard library function taking a reference to non-const basic_string as an argument.

— Calling non-const member functions, except operator[], at, front, back, begin, rbegin, end, and rend.

For a COW string, calling non-const operator[] would require making a copy (and invalidating references), which is disallowed by the paragraph above. Hence, it's no longer legal to have a COW string in C++11.

The answers by Dave S and gbjbaanb are correct. (And Luc Danton's is correct too, although it's more a side-effect of forbidding COW strings rather than the original rule that forbids it.)

But to clear up some confusion, I'm going to add some further exposition. Various comments link to a comment of mine on the GCC bugzilla which gives the following example:

std::string s("str");
const char* p = s.data();
{
std::string s2(s);
(void) s[0];
}
std::cout << *p << '\n';  // p is dangling

The point of that example is to demonstrate why GCC's reference counted (COW) string is not valid in C++11. The C++11 standard requires this code to work correctly. Nothing in the code permits the p to be invalidated in C++11.

Using GCC's old reference-counted std::string implementation, that code has undefined behaviour, because p p0 invalidated, becoming a dangling pointer. (What happens is that when s2 is constructed it shares the data with s, but obtaining a non-const reference via s[0] requires the data to be unshared, so s does a "copy on write" because the reference s[0] could potentially be used to write into s, then s2 goes out of scope, destroying the array pointed to by p).

The C++03 standard explicitly permits that behaviour in 21.3 [lib.basic.string] p5 where it says that subsequent to a call to data() the first call to operator[]() may invalidate pointers, references and iterators. So GCC's COW string was a valid C++03 implementation.

The C++11 standard no longer permits that behaviour, because no call to operator[]() may invalidate pointers, references or iterators, irrespective of whether they follow a call to data().

So the example above must work in C++11, but does not work with libstdc++'s kind of COW string, therefore that kind of COW string is not permitted in C++11.

Is COW basic_string prohibited in C++11 and later?

Regarding

Am I correct that C++11 does not admit COW based implementations of std::string?

Yes.

Regarding

If so, is this restriction explicitly stated somewhere in the new standard (where)?

Almost directly, by requirements of constant complexity for a number of operations that would require O(n) physical copying of the string data in a COW implementation.

For example, for the member functions

auto operator[](size_type pos) const -> const_reference;
auto operator[](size_type pos) -> reference;

… which in a COW implementation would ¹both trigger string data copying to un-share the string value, the C++11 standard requires

C++11 §21.4.5/4:

Complexity: constant time.

… which rules out such data copying, and hence, COW.

C++03 supported COW implementations by not having these constant complexity requirements, and by, under certain restrictive conditions, allowing calls to operator[](), at(), begin(), rbegin(), end(), or rend() to invalidate references, pointers and iterators referring to the string items, i.e. to possibly incur a COW data copying. This support was removed in C++11.


Is COW also prohibited via the C++11 invalidation rules?

In another answer which at the time of writing is selected as solution, and which is heavily upvoted and therefore apparently believed, it's asserted that

For a COW string, calling non-const operator[] would require making a copy (and invalidating references), which is disallowed by the [quoted] paragraph above [C++11 §21.4.1/6]. Hence, it's no longer legal to have a COW string in C++11.

That assertion is incorrect and misleading in two main ways:

  • It incorrectly indicates that only the non-const item accessors need to trigger a COW data copying.
    But also the const item accessors need to trigger data copying, because they allow client code to form references or pointers that (in C++11) it's not permitted to invalidate later via the operations that can trigger COW data copying.
  • It incorrectly assumes that COW data copying can cause reference invalidation.
    But in a correct implementation COW data copying, un-sharing the string value, is done at a point before there are any references that can be invalidated.

To see how a correct C++11 COW implementation of basic_string would work, when the O(1) requirements that make this invalid are ignored, think of an implementation where a string can switch between ownership policies. A string instance starts out with policy Sharable. With this policy active there can be no external item references. The instance can transition to Unique policy, and it must do so when an item reference is potentially created such as with a call to .c_str() (at least if that produces a pointer to the internal buffer). In the general case of multiple instances sharing ownership of the value, this entails copying the string data. After that transition to Unique policy the instance can only transition back to Sharable by an operation that invalidates all references, such as assignment.

So, while that answer's conclusion, that COW strings are ruled out, is correct, the reasoning offered is incorrect and strongly misleading.

I suspect the cause of this misunderstanding is a non-normative note in C++11's annex C:

C++11 §C.2.11 [diff.cpp03.strings], about §21.3:

Change: basic_string requirements no longer allow reference-counted strings
Rationale: Invalidation is subtly different with reference-counted strings. This change regularizes behavor (sic) for this International Standard.
Effect on original feature: Valid C ++ 2003 code may execute differently in this International Standard

Here the rationale explains the primary why one decided to remove the C++03 special COW support. This rationale, the why, is not how the standard effectively disallows COW implementation. The standard disallows COW via the O(1) requirements.

In short, the C++11 invalidation rules don't rule out a COW implementation of std::basic_string. But they do rule out a reasonably efficient unrestricted C++03-style COW implementation like the one in at least one of g++'s standard library implementations. The special C++03 COW support allowed practical efficiency, in particular using const item accessors, at the cost of subtle, complex rules for invalidation:

C++03 §21.3/5 which includes “first call” COW support:

References, pointers, and iterators referring to the elements of a basic_string sequence may be invalidated by the following uses of that basic_string object:
— As an argument to non-member functions swap() (21.3.7.8), operator>>() (21.3.7.9), and getline() (21.3.7.9).
— As an argument to basic_string::swap().
— Calling data() and c_str() member functions.
— Calling non-const member functions, except operator[](), at(), begin(), rbegin(), end(), and rend().
— Subsequent to any of the above uses except the forms of insert() and erase() which return iterators, the first call to non-const member functions operator[](), at(), begin(), rbegin(), end(), or rend().

These rules are so complex and subtle that I doubt many programmers, if any, could give a precise summary. I could not.


What if O(1) requirements are disregarded?

If the C++11 constant time requirements on e.g. operator[] are disregarded, then COW for basic_string could be technically feasible, but difficult to implement.

Operations which could access the contents of a string without incurring COW data copying include:

  • Concatenation via +.
  • Output via <<.
  • Using a basic_string as argument to standard library functions.

The latter because the standard library is permitted to rely on implementation specific knowledge and constructs.

Additionally an implementation could offer various non-standard functions for accessing string contents without triggering COW data copying.

A main complicating factor is that in C++11 basic_string item access must trigger data copying (un-sharing the string data) but is required to not throw, e.g. C++11 §21.4.5/3 “Throws: Nothing.”. And so it can't use ordinary dynamic allocation to create a new buffer for COW data copying. One way around this is to use a special heap where memory can be reserved without being actually allocated, and then reserve the requisite amount for each logical reference to a string value. Reserving and un-reserving in such a heap can be constant time, O(1), and allocating the amount that one has already reserved, can be noexcept. In order to comply with the standard's requirements, with this approach it seems there would need to be one such special reservation-based heap per distinct allocator.


Notes:
¹ The const item accessor triggers a COW data copying because it allows the client code to obtain a reference or pointer to the data, which it's not permitted to invalidate by a later data copying triggered by e.g. the non-const item accessor.

I was always wondering about immutable cows: once cow is created I could be changed only through assignment from another cow, hence it will be compliant with the standard.

I had time to try it today for a simple comparison test: a map of size N keyed by string/cow with every node holding a set of all strings in the map (we have NxN number of objects).

With strings sized ~300 bytes and N=2000 cows are slightly faster and use almost order of magnitude less memory. See below, sizes are in kbs, run b is with cows.

~/icow$ ./tst 2000
preparation a
run
done a: time-delta=6 mem-delta=1563276
preparation b
run
done a: time-delta=3 mem-delta=186384