为什么定义数组外的第一个元素默认为零?

我正在准备 C + + 入门考试的期末考试。我们的教授给了我们这个练习题:

解释为什么代码产生以下输出: 120 200 16 0

using namespace std;
int main()
{
int x[] = {120, 200, 16};
for (int i = 0; i < 4; i++)
cout << x[i] << " ";
}

这个问题的样本答案是:

Cout 语句只是循环遍历数组元素,数组元素的下标是由 for 循环的增量定义的。元素大小不是由数组初始化定义的。For 循环定义数组的大小,该数组恰好超过初始化元素的数量,因此最后一个元素的默认值为零。第一个 for 循环打印元素0(120) ,第二个 for 循环打印元素1(200) ,第三个循环打印元素2(16) ,第四个循环打印默认数组值0,因为元素3没有初始化。此时,i 现在超出了条件,for 循环终止。

我有点困惑,为什么数组外的最后一个元素总是“默认”为零。为了进行试验,我将问题的代码粘贴到 IDE 中,但是将 for 循环更改为 for (int i = 0; i < 8; i++)。然后输出改为 120 200 16 0 4196320 0 547306487 32655。为什么在尝试从定义大小之外的数组中访问元素时没有错误?程序是否只输出上次保存到该内存地址的“剩余”数据?

6354 次浏览

It does not default to zero. The sample answer is wrong. Undefined behaviour is undefined; the value may be 0, it may be 100. Accessing it may cause a seg fault, or cause your computer to be formatted.

As to why it's not an error, it's because C++ is not required to do bounds checking on arrays. You could use a vector and use the at function, which throws exceptions if you go outside the bounds, but arrays do not.

I'm a bit confused as to why that last element outside of the array always "defaults" to zero.

In this declaration

int x[] = {120, 200, 16};

the array x has exactly three elements. So accessing memory outside the bounds of the array invokes undefined behavior.

That is, this loop

 for (int i = 0; i < 4; i++)
cout << x[i] << " ";

invokes undefined behavior. The memory after the last element of the array can contain anything.

On the other hand, if the array were declared as

int x[4] = {120, 200, 16};

that is, with four elements, then the last element of the array that does not have an explicit initializer will be indeed initialized to zero.

It's causing undefined behaviour, this is the only valid answer. Compiler expects your array x to contain exactly three elements, what you see in the output when reading fourth integer is unknown and on some systems/processors may cause hardware interrupt caused by trying to read memory which is not addressable (system don't know how to access physical memory at such address). Compiler might reserve for x memory from stack, or might use registers (as its very small). The fact you get 0 is actually accidental. With the use of address sanitizer in clang (-fsanitize=address option) you can see this:

https://coliru.stacked-crooked.com/a/993d45532bdd4fc2

the short output is:

==9469==ERROR: AddressSanitizer: stack-buffer-overflow

You can investigate it even further, on compiler explorer, with un-optimized GCC: https://godbolt.org/z/8T74cr83z (includes asm and program output)
In that version, the output is 120 200 16 3 because GCC put i on the stack after the array.

You will see that gcc generates following assembly for your array:

    mov     DWORD PTR [rbp-16], 120    # array initializer
mov     DWORD PTR [rbp-12], 200
mov     DWORD PTR [rbp-8], 16
mov     DWORD PTR [rbp-4], 0       # i initializer

so, indeed - there is a fourth element with 0 value. But it's actually the i initializer, and has a different value by the time it's read in the loop. Compilers don't invent extra array elements; at best there will just be unused stack space after them.

See the optimization level of this example - its -O0 - so consistent-debugging minimal optimizations; that's why i is kept in memory instead of a call-preserved register. Start adding optimizations, lets say -O1 and you will get:

    mov     DWORD PTR [rsp+4], 120
mov     DWORD PTR [rsp+8], 200
mov     DWORD PTR [rsp+12], 16

More optimizations may optimize your array entirely, for example unrolling and just using immediate operands to set up calls to cout.operator<<. At that point the undefined-behaviour would be fully visible to the compiler and it would have to come up with something to do. (Registers for the array elements would be plausible in other cases, if the array values were only ever accessed by a constant (after optimization) index.)

Correcting the answer

No it doesn't default to 0. It's undefined behaviour. It just happened to be 0 in this condition, this optimization and this compiler. Trying to access uninitialized or unallocated memory is undefined behaviour.

Because it's literally "undefined" and the standard has nothing else to say about this, your assembly output is not going to be consistent. The compiler might store the array in an SIMD register, who knows what the output will be?

Quote from the sample answer:

and the forth loop prints the default array value of zero since nothing is initialized for element 3

That's the most wrong statement ever. I guess there's a typo in the code and they wanted to make it

int x[4] = {120, 200, 16};

and mistakenly made it x[4] into just x[]. If not, and it was intentional, I don't know what to say. They're wrong.

Why isn't it an error?

It's not an error because that's how the stack works. Your application doesn't need to allocate memory in the stack to use it, it's already yours. You may do whatever with your stack as you wish. When you declare a variable like this:

int a;

all you're doing is telling the compiler, "I want 4 bytes of my stack to be for a, please don't use that memory for anything else." at compile time. Look at this code:

#include <stdio.h>


int main() {
int a;
}

Assembly:

    .file   "temp.c"
.text
.globl  main
.type   main, @function
main:
.LFB0:
.cfi_startproc
endbr64
pushq   %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq    %rsp, %rbp
.cfi_def_cfa_register 6 /* Init stack and stuff */
movl    $0, %eax
popq    %rbp
.cfi_def_cfa 7, 8
ret /* Pop the stack and return? Yes. It generated literally no code.
All this just makes a stack, pops it and returns. Nothing. */
.cfi_endproc /* Stuff after this is system info, and other stuff
we're not interested. */
.LFE0:
.size   main, .-main
.ident  "GCC: (Ubuntu 11.1.0-1ubuntu1~20.04) 11.1.0"
.section    .note.GNU-stack,"",@progbits
.section    .note.gnu.property,"a"
.align 8
.long   1f - 0f
.long   4f - 1f
.long   5
0:
.string "GNU"
1:
.align 8
.long   0xc0000002
.long   3f - 2f
2:
.long   0x3
3:
.align 8
4:

Read the comments in the code for explanation.

So, you can see int x; does nothing. And if I turn on optimisations, the compiler won't even bother making a stack and doing all those stuff and instead directly return. int x; is just a compile-time command to the compiler to say:

x is a variable that is a signed int. It needs 4 bytes, please continue declaration after skipping these 4 bytes(and alignment).

Variables in high-level languages(of the stack) only exist to make the "distribution" of the stack more systematic and in a way that it's readable. The declaration of a variable is not a run-time process. It just teaches the compiler how to distribute the stack among the variables and prepare the program accordingly. When executing, the program allocates a stack(that's a run-time process) but it's already hardcoded with which variables get what part of the stack. For eg. variable a might get -0(%rbp) to -4(%rbp) while b gets -5(%rbp) to -8(%rbp). These values are determined at compile time. Names of variables also don't exist in compile time, they're just a way to teach the compiler how to prepare the program to use its stack.

You, as the user can use the stack as freely as you like; but you may not. You should always declare the variable or the array to let the compiler know.

Bounds checking

In languages like Go, even though your stack is yours, the compiler will insert extra checks to make sure you're not using undeclared memory by accident. It's not done in C and C++ for performance reasons and it causes the dreaded undefined behaviour and Segmentation fault to occur more frequently.

Heap and data section

Heap is where large data gets stored. No variables are stored here, only data; and one or more of your variables will contain pointers to that data. If you use stuff that you haven't allocated(done at run-time), you get a segmentation fault.

The Data section is another place where stuff can be stored. Variables can be stored here. It's stored with your code, so exceeding allocation is quite dangerous as you may accidentally modify the program's code. As it's stored with your code, it's obviously also allocated at compile time. I don't actually know much about memory safety in the data section. Apparently, you can exceed it without the OS complaining, but I know no more as I'm no system hacker and have no dubious purpose for using this for malicious intents. Basically, I have no idea about exceeding allocation in the data section. Hope someone will comment(or answer) about it.

All assembly shown above is compiled C by GCC 11.1 on an Ubuntu machine. It's in C and not C++ to improve readability.

The element size is not defined by the array initialization. The for loop defines the size of the array, which happens to exceed the number of initialized elements, thereby defaulting to zero for the last element.

This is flat-out incorrect. From section 11.6.1p5 of the C++17 standard:

An array of unknown bound initialized with a brace-enclosed initializer-list containing n initializer-clauses, where n shall be greater than zero, is defined as having n elements (11.3.4). [ Example:

int x[] = { 1, 3, 5 };

declares and initializes x as a one-dimensional array that has three elements since no size was specified and there are three initializers. — end example ]

So for an array without an explicit size, the initializer defines the size of the array. The for loop reads past the end of the array, and doing so triggers undefined behavior.

The fact that 0 is printing for the non-existent 4th element is just a manifestation of undefined behavior. There's no guarantee that that value will be printed. In fact, when I run this program I get 3 for the last value when I compile with -O0 and 0 when compiling with -O1.