这个源代码在 C 语言中打开一个字符串。它是如何做到的?

我在阅读一些模拟器代码时,发现了一些非常奇怪的现象:

switch (reg){
case 'eax':
/* and so on*/
}

这怎么可能呢? 我以为你只能用 switch积分类型。是不是有什么宏观上的花招?

10232 次浏览

(Only you can answer the "macro trickery" part - unless you paste up more code. But there's not much here for macros to work on - formally you are not allowed to redefine keywords; the behaviour on doing that is undefined.)

In order to achieve program readability, the witty developer is exploiting implementation defined behaviour. 'eax' is not a string, but a multi-character constant. Note very carefully the single quotation characters around eax. Most likely it is giving you an int in your case that's unique to that combination of characters. (Quite often each character occupies 8 bits in a 32 bit int). And everyone knows you can switch on an int!

Finally, a standard reference:

The C99 standard says:

6.4.4.4p10: "The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined."

According to the C Standard (6.8.4.2 The switch statement)

3 The expression of each case label shall be an integer constant expression...

and (6.6 Constant expressions)

6 An integer constant expression shall have integer type and shall only have operands that are integer constants, enumeration constants, character constants, sizeof expressions whose results are integer constants, and floating constants that are the immediate operands of casts. Cast operators in an integer constant expression shall only convert arithmetic types to integer types, except as part of an operand to the sizeof operator.

Now what is 'eax'?

The C Standard (6.4.4.4 Character constants)

2 An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'...

So 'eax' is an integer character constant according to the paragraph 10 of the same section

  1. ...The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.

So according to the first mentioned quote it can be an operand of an integer constant expression that may be used as a case label.

Pay attention to that a character constant (enclosed in single quotes) has type int and is not the same as a string literal (a sequence of characters enclosed in double quotes) that has a type of a character array.

As other have said, this is an int constant and its actual value is implementation-defined.

I assume the rest of the code looks something like

if (SOMETHING)
reg='eax';
...
switch (reg){
case 'eax':
/* and so on*/
}

You can be sure that 'eax' in the first part has the same value as 'eax' in the second part, so it all works out, right? ... wrong.

In a comment @Davislor lists some possible values for 'eax':

... 0x65, 0x656178, 0x65617800, 0x786165, 0x6165, or something else

Notice the first potential value? That is just 'e', ignoring the other two characters. The problem is the program probably uses 'eax', 'ebx', and so on. If all these constants have the same value as 'e' you end up with

switch (reg){
case 'e':
...
case 'e':
...
...
}

This doesn't look too good, does it?

The good part about "implementation-defined" is that the programmer can check the documentation of their compiler and see if it does something sensible with these constants. If it does, home free.

The bad part is that some other poor fellow can take the code and try to compile it using some other compiler. Instant compile error. The program is not portable.

As @zwol pointed out in the comments, the situation is not quite as bad as I thought, in the bad case the code doesn't compile. This will at least give you an exact file name and line number for the problem. Still, you will not have a working program.

The code fragment uses an historical oddity called multi-character character constant, also referred to as multi-chars.

'eax' is an integer constant whose value is implementation defined.

Here is an interesting page on multi-chars and how they can be used but should not:

http://www.zipcon.net/~swhite/docs/computers/languages/c_multi-char_const.html


Looking back further away into the rearview mirror, here is how the original C manual by Dennis Ritchie from the good old days ( https://www.bell-labs.com/usr/dmr/www/cman.pdf ) specified character constants.

2.3.2 Character constants

A character constant is 1 or 2 characters enclosed in single quotes ‘‘ ' ’’. Within a character constant a single quote must be preceded by a back-slash ‘‘\’’. Certain non-graphic characters, and ‘‘\’’ itself, may be escaped according to the following table:

    BS \b
NL \n
CR \r
HT \t
ddd \ddd
\ \\

The escape ‘‘\ddd’’ consists of the backslash followed by 1, 2, or 3 octal digits which are taken to specify the value of the desired character. A special case of this construction is ‘‘\0’’ (not followed by a digit) which indicates a null character.

Character constants behave exactly like integers (not, in particular, like objects of character type). In conformity with the addressing structure of the PDP-11, a character constant of length 1 has the code for the given character in the low-order byte and 0 in the high-order byte; a character constant of length 2 has the code for the first character in the low byte and that for the second character in the high-order byte. Character constants with more than one character are inherently machine-dependent and should be avoided.

The last phrase is all you need to remember about this curious construction: Character constants with more than one character are inherently machine-dependent and should be avoided.