包含未定义行为的源代码使编译器崩溃合法吗?

假设我编译了一些写得很糟糕的 C + + 源代码,调用了未定义行为,因此(正如他们所说的)“任何事情都可能发生”。

从 C + + 语言规范认为“一致性”编译器可以接受的角度来看,这个场景中的“任何事情”是否包括编译器崩溃(或者窃取我的密码,或者在编译时行为不当或出错) ,或者未定义行为的范围是否仅限于结果可执行文件运行时可能发生的情况?

8527 次浏览

The normative definition of undefined behavior is as follows:

[defns.undefined]

behavior for which this International Standard imposes no requirements

[ Note: Undefined behavior may be expected when this International Standard omits any explicit definition of behavior or when a program uses an erroneous construct or erroneous data. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message). Many erroneous program constructs do not engender undefined behavior; they are required to be diagnosed. Evaluation of a constant expression never exhibits behavior explicitly specified as undefined.  — end note ]

While the note itself is not normative, it does describe a range of behaviors implementations are known to exhibit. So crashing the compiler (which is translation terminating abruptly), is legitimate according to that note. But really, as the normative text says, the standard doesn't place any bounds for either execution or translation. If an implementation steals your passwords, it's not a violation of any contract laid forth in the standard.

What does "legal" mean here? Anything that doesn't contradict the C standard or C++ standard is legal, according to these standards. If you execute a statement i = i++; and as a result dinosaurs take over the world, that doesn't contradict the standards. It does however contradict the laws of physics, so it's not going to happen :-)

If undefined behaviour crashes your compiler, that doesn't violate the C or C++ standard. It does however mean that the quality of the compiler could (and probably should) be improved.

In previous versions of the C standard, there were statements that were errors or not dependent on undefined behaviour:

char* p = 1 / 0;

Assigning a constant 0 to a char* is allowed. Allowing a non-zero constant is not. Since the value of 1 / 0 is undefined behaviour, it is undefined behaviour whether the compiler should or should not accept this statement. (Nowadays, 1 / 0 does not meet the definition of "integer constant expression" anymore).

Most kinds of UB that we usually worry about, like NULL-deref or divide by zero, are runtime UB. Compiling a function that would cause runtime UB if executed must not cause the compiler to crash. Unless maybe it can prove that the function (and that path through the function) definitely will be executed by the program.

(2nd thoughts: maybe I haven't considered template / constexpr required evaluation at compile time. Possibly UB during that is allowed to cause arbitrary weirdness during translation even if the resulting function is never called.)

The behaving during translation part of the ISO C++ quote in @StoryTeller's answer is similar to language used in the ISO C standard. C doesn't include templates or constexpr mandatory eval at compile time.

But fun fact: ISO C says in a note that if translation is terminated, it must be with a diagnostic message. Or "behaving during translation ... in a documented manner". I don't think "ignoring the situation completely" could be read as including stopping translation.


Old answer, written before I learned about translation-time UB. It's true for runtime-UB, though, and thus potentially still useful.


There's no such thing as UB that happens at compile time. It can be visible to the compiler along a certain path of execution, but in C++ terms it hasn't happened until execution reaches that path of execution through a function.

Defects in a program that make it impossible to even compile aren't UB, they're syntax errors. Such a program is "not well-formed" in C++ terminology (if I have my standardese correct). A program can be well-formed but contain UB. Difference between Undefined Behavior and Ill-formed, no diagnostic message required

Unless I'm misunderstanding something, ISO C++ requires this program to compile and execute correctly, because execution never reaches the divide by zero. (In practice (Godbolt), good compilers just make working executables. gcc/clang warn about x / 0 but not this, even when optimizing. But anyway, we're trying to tell how low ISO C++ allows quality of implementation to be. So checking gcc/clang is hardly a useful test other than to confirm I wrote the program correctly.)

int cause_UB() {
int x=0;
return 1 / x;      // UB if ever reached.
// Note I'm avoiding  x/0  in case that counts as translation time UB.
// UB still obvious when optimizing across statements, though.
}


int main(){
if (0)
cause_UB();
}

A use-case for this might involve the C preprocessor, or constexpr variables and branching on those variables, which leads to nonsense in some paths that are never reached for those choices of constants.

Paths of execution that cause compile-time-visible UB can be assumed to be never take, e.g. a compiler for x86 could emit a ud2 (cause illegal instruction exception) as the definition for cause_UB(). Or within a function, if one side of an if() leads to provable UB, the branch can be removed.

But the compiler still has to compile everything else in a sane and correct way. All paths that don't encounter (or can't be proved to encounter) UB must still be compiled to asm that executes as-if the C++ abstract machine was running it.


You could argue that unconditional compile-time-visible UB in main is an exception to this rule. Or otherwise compile-time-provable that execution starting at main does in fact reach guaranteed UB.

I'd still argue that legal compiler behaviours include producing a grenade that explodes if run. Or more plausibly, a definition of main that consists of a single illegal instruction. I'd argue that if you never run the program, there hasn't been any UB yet. The compiler itself isn't allowed to explode, IMO.


Functions containing possible or provable UB inside branches

UB along any given path of execution reaches backwards in time to "contaminate" all previous code. But in practice compilers can only take advantage of that rule when they can actually prove that paths of execution lead to compile-time-visible UB. e.g.

int minefield(int x) {
if (x == 3) {
*(char*)nullptr = x/0;
}


return x * 5;
}

The compiler has to make asm that works for all x other than 3, up to the points where x * 5 causes signed-overflow UB at INT_MIN and INT_MAX. If this function is never called with x==3, the program of course contains no UB and must work as written.

We might as well have written if(x == 3) __builtin_unreachable(); in GNU C to tell the compiler that x is definitely not 3.

In practice there's "minefield" code all over the place in normal programs. e.g. any division by an integer promises the compiler that it's non-zero. Any pointer deref promises the compiler that it's non-NULL.

The Standard would impose no requirements upon an implementation's behavior if it encounters #include "'foo'". If compiler writer judges that it would be useful to process include directives of that form (containing the apostrophes within the file name) by running the indicated program with its output directed to a temporary file and then behaving as a #include of that file, then an attempt to process a program containing the above line could run program foo, with whatever consequences result.

Thus, there is in general no limit as to what might happen as a consequence of trying to translate a C program, even if one makes no effort to run it.