GDB corrupted stack frame - How to debug?

小开

Look at some of your other registers to see if one of them has the stack pointer cached in them. From there, you might be able to retrieve a stack. Also, if this is embedded, quite often stack is defined at a very particular address. Using that, you can also sometimes get a decent stack. This all assumes that when you jumped to hyperspace, your program didn't puke all over memory along the way...

小开

Assuming that the stack pointer is valid...

It may be impossible to know exactly where the SEGV occurs from the backtrace -- I think the first two stack frames are completely overwritten. 0xbffff284 seems like a valid address, but the next two aren't. For a closer look at the stack, you can try the following:

gdb$ x/32ga $rsp

or a variant (replace the 32 with another number). That will print out some number of words (32) starting from the stack pointer of giant (g) size, formatted as addresses (a). Type 'help x' for more info on format.

Instrumenting your code with some sentinel 'printf''s may not be a bad idea, in this case.

小开

最佳答案

Those bogus adresses (0x00000002 and the like) are actually PC values, not SP values. Now, when you get this kind of SEGV, with a bogus (very small) PC address, 99% of the time it's due to calling through a bogus function pointer. Note that virtual calls in C++ are implemented via function pointers, so any problem with a virtual call can manifest in the same way.

An indirect call instruction just pushes the PC after the call onto the stack and then sets the PC to the target value (bogus in this case), so if this is what happened, you can easily undo it by manually popping the PC off the stack. In 32-bit x86 code you just do:

(gdb) set $pc = *(void **)$esp
(gdb) set $esp = $esp + 4

With 64-bit x86 code you need

(gdb) set $pc = *(void **)$rsp
(gdb) set $rsp = $rsp + 8

Then, you should be able to do a bt and figure out where the code really is.

The other 1% of the time, the error will be due to overwriting the stack, usually by overflowing an array stored on the stack. In this case, you might be able to get more clarity on the situation by using a tool like valgrind

小开

If the situation is fairly simple, Chris Dodd's answer is the best one. It does look like it jumped through a NULL pointer.

However, it is possible the program shot itself in the foot, knee, neck, and eye before crashing—overwrote the stack, messed up the frame pointer, and other evils. If so, then unraveling the hash is not likely to show you potatoes and meat.

The more efficient solution will be to run the program under the debugger, and step over functions until the program crashes. Once a crashing function is identified, start again and step into that function and determine which function it calls causes the crash. Repeat until you find the single offending line of code. 75% of the time, the fix will then be obvious.

In the other 25% of situations, the so-called offending line of code is a red herring. It will be reacting to (invalid) conditions set up many lines before—maybe thousands of lines before. If that is the case, the best course chosen depends on many factors: mostly your understanding of the code and experience with it:

Perhaps setting a debugger watchpoint or inserting diagnostic printf's on critical variables will lead to the necessary A ha!
Maybe changing test conditions with different inputs will provide more insight than debugging.
Maybe a second pair of eyes will force you to check your assumptions or gather overlooked evidence.
Sometimes, all it takes is going to dinner and thinking about the gathered evidence.

Good luck!

小开

If it's a stack overwrite, the values may well correspond to something recognisable from the program.

For example, I just found myself looking at the stack

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x000000000000342d in ?? ()
#2  0x0000000000000000 in ?? ()

and 0x342d is 13357, which turned out to be a node-id when I grepped the application logs for it. That immediately helped narrow down candidate sites where the stack overwrite might have occurred.

小开

funny...we had the exact same thing going on with a driver in an old C app here. the top 2 stack trace value pointers in hex were data bytes being read in off the port. I just happened to notice one because it was familiar.