Understanding the behavior of C's preprocessor when a macro indirectly expands itself

While I was working on a big project full of macro tricks and wizardry, I stumbled upon a bug in which a macro was not expanding properly. The resulting output was "EXPAND(0)", but EXPAND was defined as "#define EXPAND(X) X", so clearly the output should have been "0".

"No problem", I thought to myself. "It's probably some silly mistake, there are some nasty macros here, after all, plenty of places to go wrong". As I thought that, I isolated the misbehaving macros into their own project, about 200 lines, and started working on a MWE to pinpoint the problem. 200 lines became 150, which in turn became 100, then 20, 10... To my absolute shock, this was my final MWE:

#define EXPAND(X) X
#define PARENTHESIS() ()
#define TEST() EXPAND(0)
   

EXPAND(TEST PARENTHESIS()) // EXPAND(0)

4 lines.

To add insult to injury, almost any modification to the macros will make them work correctly:

#define EXPAND(X) X
#define PARENTHESIS() ()
#define TEST() EXPAND(0)


// Manually replaced PARENTHESIS()
EXPAND(TEST ()) // 0
#define EXPAND(X) X
#define PARENTHESIS() ()
#define TEST() EXPAND(0)


// Manually replaced TEST()
EXPAND(EXPAND(0)) // 0
// Set EXPAND to 0 instead of X
#define EXPAND(X) 0
#define PARENTHESIS() ()
#define TEST() EXPAND(0)


EXPAND(TEST PARENTHESIS()) // 0

But most importantly, and most oddly, the code below fails in the exact same way:

#define EXPAND(X) X
#define PARENTHESIS() ()
#define TEST() EXPAND(0)
   

EXPAND(EXPAND(EXPAND(EXPAND(TEST PARENTHESIS())))) // EXPAND(0)

This means the preprocessor is perfectly capable of expanding EXPAND, but for some reason, it absolutely refuses to expand it again in the last step.

Now, how I'm going to solve this problem in my actual program is neither here nor there. Although a solution would be nice (i.e. a way to expand the token EXPAND(TEST PARENTHESIS()) to 0), the thing I'm most interested in is: why? Why did the C preprocessor come to the conclusion that "EXPAND(0)" was the correct expansion in the first case, but not in the other ones?

Although it's easy to find resources on what the C preprocessor does (and some magic that you can do with it), I've yet to find one that explains how it does it, and I want to take this opportunity to understand better how the preprocessor does its job and what rules it uses when expanding macros.

So in light of that: What is the reasoning behind the preprocessor's decision to expand the final macro to "EXPAND(0)" instead of "0"?


Edit: After reading Chris Dodd's very detailed, logical and well-put answer, I did what anybody would do in the same situation... try to come up with a counterexample :)

What I concocted was this different 4-liner:

#define EXPAND(X) X
#define GLUE(X,Y) X Y
#define MACRO() GLUE(A,B)


EXPAND(GLUE(MACRO, ())) // GLUE(A,B)

Now, knowing the fact that the C preprocessor is not Turing complete, there is no way the above will ever expand to A B. If that were the case, GLUE would expand MACRO and MACRO would expand GLUE. That would lead to the possibility of unlimited recursion, probably implying Turing Completeness for the Cpp. So sadly for the preprocessor wizards out there, the above macro not expanding is a guarantee.

It failing is not really the problem, the real problem is: Where? Where did the preprocessor decide to stop the expansion?

Analyzing the steps:

  • step 1 sees the macro EXPAND and scans in argument list GLUE(MACRO, ()) for X
  • step 2 recognizes GLUE(MACRO, ()) as a macro:
    • step 1 (nested) gets MACRO and () as arguments
    • step 2 scans them but finds no macro
    • step 3 inserts into the macro body yielding: MACRO ()
    • step 4 suppresses GLUE and scans MACRO () for macros, finding MACRO
      • step 1 (nested) gets an empty token sequence for the argument
      • step 2 scans that empty sequence and does nothing
      • step 3 inserts into the macro body GLUE(A,B)
      • step 4 scans GLUE(A,B) for macros, finding GLUE. It is suppressed, however, so it leaves as is.
  • so the final value for X after step 2 is GLUE(A,B) (notice that since we are not in step 4 of GLUE, in theory, it is not suppressed anymore)
  • step 3 inserts that into the body, giving GLUE(A,B)
  • step 4 suppresses EXPAND and scans GLUE(A,B) for more macros, finding GLUE (uuh)
    • step 1 gets A and B for the arguments (oh no)
    • step 2 does nothing with them
    • step 3 substitutes into the body giving A B (well...)
    • step 4 scans A B for macros, but finds nothing
  • the final result is then A B

Which would be our dream. Sadly, the macro expands to GLUE(A,B).

So our question is: Why?

2834 次浏览

After reading Chris Dodd's masterful answer and spending some time thinking, I think I've wrapped my head around the problem.

If you are using the C preprocessor like a normal and sane person, there's really no problem to avoid here. Just don't make macros that mention each other and you'll be fine. If you're dabbling with the dark arts however, you'll find that it's suprisingly easy to stumble upon the problem above. So here, I'm going to explain how to avoid it.

Note that I'm not going to explain why these happen here (Chris' answer already explains the why very well), but I'm going to lay out where they happen and how to solve them.


The problem may happen when your macro "call" has a postponed/indirect expansion. What I mean by "indirect" is an expansion that cannot be made right then and there, but only after some concatenation or substitution/replacement. To understand it, let's look at a safe example first:

#define EXPAND(MACRO) MACRO
#define MY_MACRO(A, B) A##B


EXPAND(MY_MACRO(1,2)) // 12

Here, MY_MACRO(1,2) is an immediate reference to a macro and has the proper arguments to be expanded as is. Therefore, the preprocessor will expand it immediately, before any replacement in EXPAND.

Now let's compare it to these examples:

#define EXPAND(MACRO, PAREN_ARGS) MACRO PAREN_ARGS
#define MY_MACRO(A, B) A##B


EXPAND(MY_MACRO, (1,2)) // 12
#define EXPAND(MACRO) MACRO (1,2)
#define MY_MACRO(A, B) A##B


EXPAND(MY_MACRO) // 12

Note that the arguments inside EXPAND ("MY_MACRO, (1,2)" and "MY_MACRO") do not "look like a macro". Even though we want them to completely expand by the end, they are not correctly formatted to be expanded right then and there in the argument. Although these work fine now, because there's no mutual reference, their expansion will be postponed to after the replacement in EXPAND, and that may pose a problem.

Whenever the macro in your argument cannot be expanded as it is, it can't contain the "callee" at any point in its expansion.

We can show this by adding mutual reference to the examples above:

#define EXPAND(MACRO, PAREN_ARGS) MACRO PAREN_ARGS
// Now MY_MACRO references EXPAND
#define MY_MACRO(A, B) EXPAND(A,B)
// Fails to expand
EXPAND(MY_MACRO, (1,2)) // EXPAND(1,2)
^       ^
|       |
|    This guy is postponed...
|
...so it cannot eventually expand to this guy.
#define EXPAND(MACRO) MACRO (1,2)
// Now MY_MACRO references EXPAND
#define MY_MACRO(A, B) EXPAND(A)
// Fails to expand
EXPAND(MY_MACRO) // EXPAND(1)
^       ^
|       |
|    This guy is postponed...
|
...so it cannot eventually expand to this guy.

Our example without the indirect expansion on the other hand, is completely safe and evaluates as expected:

#define EXPAND(MACRO) MACRO
// Now MY_MACRO references EXPAND
#define MY_MACRO(A, B) EXPAND(A)
// Succeeds
EXPAND(MY_MACRO(1,2)) // 1
^       ^
|       |
|    This guy can expand right here...
|
...so it can freely expand to this guy.

Since MY_MACRO(1,2) is an actual macro, it can be evaluated right then and there, and no problems occur.

If my judgement is correct, you don't have to worry about any previous "callees", only the one that directly has the "postponed macro" as an argument. Neither do you have to worry about the other arguments, as the preprecessor will also try to completely expand them before replacement as well. Furthermore, this does not pose a problem if the complete expansion of the macro is not required.

However, if you want the macro in your argument to expand completely and its expansion is going to be indirect, double check it.

Macro expansion is a complex process that is really only understandable by understanding the steps that occur.

  1. When a macro with arguments is recognized (macro name token followed by ( token), the following tokens up to the matching ) are scanned and split (on , tokens). No macro expansion happens while this is happening (so the ,s and ) must be present in the input stream directly and cannot be in other macros).

  2. Each macro argument whose name appears in the macro body not preceeded by # or ## or followed by ## is "prescanned" for macros to expand -- any macros entirely within the argument will be recursively expanded before substituting into the macro body.

  3. The resulting macro argument token streams are substituted into the body of the macro. Arguments involved in # or ## operations are modified (stringized or pasted) and substituted based on the original parser tokens from step 1 (step 2 does not occur for these).

  4. The resulting macro body token stream is scanned again for macros to expand, but ignoring the macro currently being expanded1. At this point further tokens in the input (after what was scanned and parsed in step 1) may be included as part of any macros recognized.

The important thing is that there are TWO DIFFERENT recursive expansions that occur (step 2 and step 4 above) and ONLY the one in step 4 ignores recursive macro expansions of the same macro. The recursive expansion in step 2 DOES NOT ignore the current macro, so can expand it recursively.

So for your example above, lets see what happens. For the input

EXPAND(TEST PARENTHESIS())
  • step 1 sees the macro EXPAND and scans in argument list TEST PARENTHESIS() for X
  • step 2 does not recognize TEST as a macro (no following (), but does recognize PARENTHESIS:
    • step 1 (nested) gets an empty token sequence for the argument
    • step 2 scans that empty sequence and does nothing
    • step 3 inserts into the macro body () yielding just that: ()
    • step 4 scans () for macros and doesn't find any
  • so the final value for X after step 2 is TEST ()
  • step 3 inserts that into the body, giving TEST ()
  • step 4 suppresses EXPAND and scans the result of step 3 for more macros, finding TEST
    • step 1 gets an empty sequence for the argument
    • step 2 does nothing
    • step 3 substitutes into the body giving EXPAND(0)
    • step 4 recursive expands that, suppressing TEST. At this point, both EXPAND and TEST are suppressed (due to being in the step 4 expansion), so nothing happens

Your other example EXPAND(TEST()) is different

  • step 1 EXPAND is recognized as a macro, and TEST() is parsed as the argument X
  • step 2, this stream is recursively parsed. Note that since this is step 2, EXPAND is NOT SUPPRESSED
    • step 1 TEST is recognized as a macro with an empty sequence argument
    • step 2 -- nothing (no macros in an empty token sequence)
    • step 3, substituted into the body giving EXPAND(0)
    • step 4, TEST is suppressed and the result recursively expanded
      • step 1, EXPAND is recognized as a macro (remember, at this point only TEST is suppressed by step 4 recursion -- EXPAND is in the step 2 recursion so is not suppressed) with 0 as its argument
      • step 2, 0 is scanned and nothing happens to it
      • step 3, substitute into the body giving 0
      • step 4, 0 is scanned again for macros (and again nothing happens)
  • step 3, the 0 is substituted as the argument X into the body of the first EXPAND
  • step 4, 0 is scanned again for macros (and again nothing happens)

so the final result here is 0


1Exactly how this happens appears to vary between implementations. Some implementations will merely not expand the macro at this point, but might expand it later if a subsequent rescan for another macro sees it again. Other implementations will 'mark' the token internally so that they don't expand the macro even with subsequent rescans. Either method can be seen as conforming to the spec, which is unclear on this detail

For the purposes of this situation, there are three relevant steps in macro replacement:

  1. Perform macro replacement on the arguments.
  2. Replace the macro with its definition, with parameters replaced with arguments.
  3. Rescan the result for further replacement while suppressing the replaced macro name.

In EXPAND(TEST PARENTHESIS()):

  • Step 1, macro replacement is performed on the argument to EXPAND, TEST PARENTHESIS():
    • TEST is not followed by parentheses, so it is not interpreted as a macro invocation.
    • PARENTHESIS() is a macro invocation, so the three steps are performed: The arguments are empty, so there is no processing for them. Then PARENTHESIS() is replaced by (). Then () is rescanned and no macros are found.
    • Step 1 is done, and we have EXPAND(TEST ()). (TEST () is not rescanned because it was not the result of any macro replacement.)
  • Step 2, EXPAND(TEST ()) is replaced with TEST ().
  • Step 3, TEST () is rescanned while suppressing EXPAND:
    • Step 1, the arguments are empty, so there is no processing for them.
    • Step 2, TEST () is replaced by EXPAND(0).
    • Step 3, EXPAND(0) is rescanned, but EXPAND is suppressed.

In EXPAND(TEST ()):

  • Step 1, macro replacement is performed on the argument to EXPAND:
    • Step 1, the arguments to TEST are empty, so there is no processing.
    • Step 2, TEST () is replaced by EXPAND(0).
    • Step 3, this replacement is rescanned, and EXPAND(0) is replaced by 0.
  • Step 2, EXPAND(TEST ()) has become EXPAND(0), and EXPAND(0) is replaced by 0.
  • Step 3, 0 is rescanned for further macros, but there are none.

The other examples in the question follow similarly. It comes down to:

  • In TEST PARENTHESIS(), the lack of parentheses after TEST results in it not being expanded while processing arguments to an enclosing macro invocation.
  • Parentheses are put after it when PARENTHESIS is expanded, but this is after TEST was scanned, and it is not rescanned during processing of the argument.
  • After the enclosing macro is replaced, TEST is rescanned and is replaced then, but, at this time, the enclosing macro’s name is suppressed.