如何修复编译 > 2GB 代码时的 GCC 编译错误?

我有大量的函数,总共大约2.8 GB 的目标代码(不幸的是,没有办法,科学计算...)

当我尝试链接它们时,我会得到(预期的) relocation truncated to fit: R_X86_64_32S错误,我希望通过指定编译器标志 -mcmodel=medium来规避这些错误。除此之外,我所控制的所有链接库都使用 -fpic标志进行编译。

尽管如此,错误仍然存在,我假设我链接到的一些库没有使用 PIC 编译。

错误是这样的:

/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini'     defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x19): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_init'    defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o: In function    `call_gmon_start':
(.text+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol      `__gmon_start__'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o: In function `__do_global_dtors_aux':
crtstuff.c:(.text+0xb): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x13): relocation truncated to fit: R_X86_64_32 against symbol `__DTOR_END__' defined in .dtors section in /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o
crtstuff.c:(.text+0x19): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x28): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x38): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x3f): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x46): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x51): additional relocation overflows omitted from the output
collect2: ld returned 1 exit status
make: *** [testsme] Error 1

以及我链接到的系统库:

-lgfortran -lm -lrt -lpthread

在哪里可以找到问题的线索?

编辑:

首先,谢谢你们的讨论。

为了澄清一点,我有数百个函数(每个大小约为1MB 的函数放在单独的对象文件中) ,如下所示:

double func1(std::tr1::unordered_map<int, double> & csc,
std::vector<EvaluationNode::Ptr> & ti,
ProcessVars & s)
{
double sum, prefactor, expr;


prefactor = +s.ds8*s.ds10*ti[0]->value();
expr =       ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -
27/10.*s.x14*s.x15*csc[49304] + 12/5.*s.x14*s.x15*csc[49305] -
3/10.*s.x14*s.x15*csc[49306] - 4/5.*s.x14*s.x15*csc[49307] +
21/10.*s.x14*s.x15*csc[49308] + 1/10.*s.x14*s.x15*csc[49309] -
s.x14*s.x15*csc[51370] - 9/10.*s.x14*s.x15*csc[51371] -
1/10.*s.x14*s.x15*csc[51372] + 3/5.*s.x14*s.x15*csc[51373] +
27/10.*s.x14*s.x15*csc[51374] - 12/5.*s.x14*s.x15*csc[51375] +
3/10.*s.x14*s.x15*csc[51376] + 4/5.*s.x14*s.x15*csc[51377] -
21/10.*s.x14*s.x15*csc[51378] - 1/10.*s.x14*s.x15*csc[51379] -
2*s.x14*s.x15*csc[55100] - 9/5.*s.x14*s.x15*csc[55101] -
1/5.*s.x14*s.x15*csc[55102] + 6/5.*s.x14*s.x15*csc[55103] +
27/5.*s.x14*s.x15*csc[55104] - 24/5.*s.x14*s.x15*csc[55105] +
3/5.*s.x14*s.x15*csc[55106] + 8/5.*s.x14*s.x15*csc[55107] -
21/5.*s.x14*s.x15*csc[55108] - 1/5.*s.x14*s.x15*csc[55109] -
2*s.x14*s.x15*csc[55170] - 9/5.*s.x14*s.x15*csc[55171] -
1/5.*s.x14*s.x15*csc[55172] + 6/5.*s.x14*s.x15*csc[55173] +
27/5.*s.x14*s.x15*csc[55174] - 24/5.*s.x14*s.x15*csc[55175] +
// ...
;


sum += prefactor*expr;
// ...
return sum;
}

对象 s相对较小,它保留所需的常量 x14、 x15、 ... 、 ds0、 ... 等,而 ti只从外部库返回一个 double。正如你所看到的,csc[]是一个预计算的值映射,它也是在单独的对象文件(同样是数百个,每个大约1MB)中计算的,形式如下:

void cscs132(std::tr1::unordered_map<int,double> & csc, ProcessVars & s)
{
{
double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -
32*s.x12pow2*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x25*s.x35*s.x45*s.mWpowinv2 +
32*s.x12pow2*s.x34*s.mbpow4*s.mWpowinv2 +
32*s.x12pow2*s.x34*s.x35*s.mbpow2*s.mWpowinv2 +
32*s.x12pow2*s.x34*s.x45*s.mbpow2*s.mWpowinv2 +
32*s.x12pow2*s.x35*s.mbpow4*s.mWpowinv2 +
32*s.x12pow2*s.x35pow2*s.mbpow2*s.mWpowinv2 +
32*s.x12pow2*s.x35pow2*s.x45*s.mWpowinv2 +
64*s.x12pow2*s.x35*s.x45*s.mbpow2*s.mWpowinv2 +
32*s.x12pow2*s.x35*s.x45pow2*s.mWpowinv2 -
64*s.x12*s.p1p3*s.x15*s.mbpow4*s.mWpowinv2 +
64*s.x12*s.p1p3*s.x15pow2*s.mbpow2*s.mWpowinv2 +
96*s.x12*s.p1p3*s.x15*s.x25*s.mbpow2*s.mWpowinv2 -
64*s.x12*s.p1p3*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
64*s.x12*s.p1p3*s.x15*s.x45*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.p1p3*s.x25*s.mbpow4*s.mWpowinv2 +
32*s.x12*s.p1p3*s.x25pow2*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.p1p3*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.p1p3*s.x25*s.x45*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.p1p3*s.x45*s.mbpow2 +
64*s.x12*s.x14*s.x15pow2*s.x35*s.mWpowinv2 +
96*s.x12*s.x14*s.x15*s.x25*s.x35*s.mWpowinv2 +
32*s.x12*s.x14*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.x14*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
64*s.x12*s.x14*s.x15*s.x35pow2*s.mWpowinv2 -
32*s.x12*s.x14*s.x15*s.x35*s.x45*s.mWpowinv2 +
32*s.x12*s.x14*s.x25pow2*s.x35*s.mWpowinv2 +
32*s.x12*s.x14*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.x14*s.x25*s.x35pow2*s.mWpowinv2 -
// ...
    

csc.insert(cscMap::value_type(192953, csc19295));
}


{
double csc19296 =      // ... ;


csc.insert(cscMap::value_type(192956, csc19296));
}


// ...
}

就是这样,最后一步就是调用所有的 func[i]并总结结果。

事实上,这是一个相当特殊和不寻常的情况: 是的,它是。这就是人们在尝试为粒子物理学做高精度计算时必须应付的问题。

编辑2:

我还应该补充一点,x12、 x13等等并不是真正的常量。它们被设置为特定的值,运行所有这些函数并返回结果,然后选择一个新的 x12、 x13等值集来生成下一个值。这必须做105到106次..。

编辑3:

谢谢您的建议和讨论,到目前为止... 我将尝试滚动循环上的代码生成,不知道如何确切地,说实话,但这是最好的选择。

顺便说一句,我没有试图躲在“这是科学计算——没有办法优化”的背后。
只是这段代码的基础来自于一个我无法真正访问的“黑盒子”,而且,对于一些简单的例子,整个代码工作得很好,我主要是对现实世界中发生的事情感到不知所措..。

编辑4:

因此,我已经成功地通过简化计算机代数系统(Mathematica)中的表达式,将 csc定义的代码大小减少了大约四分之一。我现在还看到了一些方法,通过在生成代码之前应用一些其他的技巧(这将使这部分代码降低到大约100MB)来减少数量级,我希望这个想法能够奏效。

现在谈谈你的回答:

我试图在 func中再次回滚循环,在这里 CAS 帮助不大,但是我已经有了一些想法。例如,按变量(如 x12, x13,...)对表达式进行排序,使用 Python 解析 csc,并生成相互关联的表。然后我至少可以生成这些部分作为循环。由于这似乎是迄今为止最好的解决方案,我认为这是最好的答案。

不过,我还是要感谢 VJo。GCC 4.6确实可以更好地运行 很多,产生更小的代码,并且速度更快。使用大型模型可以按原样处理代码。所以从技术上讲,这是正确的答案,但是改变整个概念是一个更好的方法。

谢谢你们的建议和帮助。如果有人感兴趣的话,我会在我准备好的时候发布最终结果。

备注:

只是对其他答案的一些注释: 我试图运行的代码并不源于简单函数/算法的扩展和愚蠢的不必要的展开。实际情况是,我们开始的东西是相当复杂的数学对象,并把它们带到一个数字 可计算的形式生成这些表达式。问题实际上在于潜在的物理理论。众所周知,中间表达式的复杂性是阶乘性的,但是当把所有这些东西结合到物理上可测量的东西——一个可观察的东西上时,它只能归结为构成表达式基础的几个非常小的函数。(在这方面,通用和 只有可用的 回答(被称为“摄动理论”)肯定存在一些“错误”)。我们试图将这种分析方法提升到另一个层次,这在分析上已经不再可行,而且所需功能的基础尚不清楚。所以我们试着这样强迫它。虽然不是最好的方法,但希望能帮助我们最终理解手边的物理学。

最后编辑:

感谢您的所有建议,我已经成功地大大减少了代码大小,使用 Mathematica 和对 func的代码生成器进行了一些修改,大致类似于上面的答案:)

我已经简化了 Mathematica 的 csc功能,把它缩减到92MB。这是不可还原的部分。第一次尝试花了很长时间,但经过一些优化之后,现在在单个 CPU 上运行这个过程大约需要10分钟。

funcs 的影响是巨大的: 它们的整个代码大小降低到大约9 MB,因此现在代码总量在100 MB 范围内。现在打开优化是有意义的,而且执行速度非常快。

再次感谢大家的建议,我学到了很多。

26796 次浏览

The x86-64 ABI used by Linux defines a "large model" specifically to avoid such size limitations, which includes 64-bit relocation types for the GOT and PLT. (See the table in section 4.4.2, and the instruction sequences in 3.5.5 which show how they are used.)

Since your functions are occupying 2.8 GB, you are out of luck, because gcc doesn't support large models. What you can do, is to reorganize your code in such a way that would allow you to split it into shared libraries which you would dynamically link.

If that is not possible, as someone suggested, instead of putting your data into code (compiling and linking it), since it is huge, you can load it at run time (either as a normal file, or you can mmap it).

EDIT

Seems like the large model is supported by gcc 4.6 (see this page). You can try that, but the above still applies about reorganizing your code.

If I read your errors correctly, what makes you carry over the limit is the initialized data section (if it was the code, you would have far more errors IMHO). Do you have big arrays of global data? If it is the case, I'd restructure the program so that they are allocated dynamically. If the data is initialized, I'd read it from a configuration file.

BTW seeing this:

(.text+0x20): undefined reference to `main'

I think you have another problem.

The error occurs because you have too much CODE, not data! This is indicated by for example __libc_csu_fini (which is a function) being referenced from _start and the relocation is truncated to fit. This means that _start (the program's true entry point) is trying to call that function via a SIGNED 32-bit offset, which has only a range of 2 GB. Since the total amount of your object code is ~2.8 GB, the facts check out.

If you could redesign your data structures, much of your code could be "compressed" by rewriting the huge expressions as simple loops.

Also, you could compute csc[] in a different program, store the results in a file, and just load them when necessary.

With a program of that side, cache misses for code are very likely to exceed the costs of looping at runtime. I would recommend that you go back to your code generator, and have it generate some compact representation for what it wants evaluated (ie, one likely to fit in D-cache), then execute that with an interpreter in your program. You could also see if you can factor out smaller kernels that still have a significant number of operations, then use those as 'instructions' in the interpreted code.

So, you already have a program that produces this text:

prefactor = +s.ds8*s.ds10*ti[0]->value();
expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -...

and

double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -...

right?

If all your functions have a similar "format" (multiply n numbers m times and add the results - or something similar) then I think you can do this:

  • change the generator program to output offsets instead of strings (i.e. instead of the string "s.ds0" it will produce offsetof(ProcessVars, ds0)
  • create an array of such offsets
  • write an evaluator which accepts the array above and the base addresses of the structure pointers and produces an result

The array+evaluator will represent the same logic as one of your functions, but only the evaluator will be code. The array is "data" and can be either generated at runtime or saved on disk and read i chunks or with a memory mapped file.

For your particular example in func1 imagine how you would rewrite the function via an evaluator if you had access to the base address of s and csc and also a vector like representation of the constants and the offsets you need to add to the base addresses to get to x14, ds8 and csc[51370]

You need to create a new form of "data" that will describe how to process the actual data you pass to your huge number of functions.

A couple of suggestions: - Optimize for size (-Os). Make your inline function calls, normal function calls. Enable string pooling.

Try splitting the things into different DLL's (shared objects, .so for linux, .dylib for Mac OS X). Make sure that they can be unloaded. Then implement something to load things on demand, and free them when not needed.

If not, split your code into different executables, and use something to communicate between them (pipes, sockets, even writing / reading to file). Clumsy, but what options do you have?

Totally alternative: - Use a dynamic language with JIT. Right on top of my head - use LuaJIT - and rewrite (regenerate?) a lot of these expressions in Lua, or other such languages and runtimes that allow code to be garbage collected.

LuaJIT is quite efficient, sometimes beating C/C++ for certain things, but often very close (sometimes can be slow due to poor garbage collection yet there). Check for yourself:

http://luajit.org/performance_x86.html

Download the scimark2.lua file from there, and compare it with the "C" version (google it) - often results are very close.

Those expressions look a lot like an alternating series to me. I don't know what the rest of the code looks like, but it doesn't seem like it'd be that hard to derive the generating expression. It'd probably be worth it at execution time too, especially if you have 2.8 GB of 2 KB unrolled code.

I think everybody agrees there should be a different way to do what you want to do. Compiling hundreds of megabyte (gigabytes?) of code, linking it into a multi-gigabyte sized executable and running it just sounds very inefficient.

If I understand your problem correctly, you use some sort of code generator, G, to generate a bunch of functions func1...N which take a bunch of maps csc1...M as input. What you want to do is to calculated csc1...M, and run a loop of 1,000,000 times for different inputs and each time find s = func1 + func2 + ... + funcN. You didn't specify how fucn1...N are related to csc1...M though.

If all that is true, it seems that you should be able to turn the problem on its head in different way which can potentially be much more manageable and even possibly faster (i.e. letting your machine's cache to actually function).

Besides the practical problem of the object files sizes, your current program will not be efficient since it does not localize access to the data (too many huge maps) and has no localized code execution (too many very long functions).

How about breaking your program into 3 phase: Phase 1 build csc1...M and storing them. Phase 2 build one func at a time, run it 1,000,000 times with each input and store the results. Phase 3 find the sum of the results of the stored func1...N outcomes for each run out of 1,000,000 times. The good part about this solution is that it can be easily made parallel across several independent machines.

Edit: @bbtrb, could you make one func and one csc available somehwere? They seem to be highly regular and compressible. For instance, func1 seems to be just a sum of expressions each consisting of 1 coefficient, 2 indexes to the variables in s and 1 index into csc. So it can be reduced to a nice loop. If you make complete examples available, I'm sure ways can be found to compress them into loops rather than long expressions.

This looks like the result of code generation gone wrong, perhaps by symbolic algebra and/or manual unrolling. Symbolic manipulations are well known to grow exponentially in the depth of the expression tree or computational graph. It is likely that automatic differentiation can be used here, which would make the code size quite small and also speed up execution dramatically.

The linker is attempting to generate 32-bit relocation offsets within a binary that has somehow exceeded these limitations. Try reduce the main program's address space requirements.

Can you split some/most of the object code into one or more libraries (also compiled with -fpic / -fPIC)? Then generate a non-static binary that links against these libs. The libraries will live in discrete memory blocks and your relocation offsets will be dynamic/absolute (64-bit) rather than relative (32-bit).

It looks to me like the code is doing numerical integration using some kind of adaptive depth method. Unfortunately, the code generator (or rather the author of the code generator) is so stupid as to generate one function per patch rather than one per type of patch. As such, it's produced too much code to be compiled, and even if it could be compiled its execution would be painful because nothing's ever shared anywhere ever. (Can you imagine the pain resulting by having to load each page of object code from disk because nothing is ever shared and so it's always a candidate for the OS to evict. To say nothing of instruction caches, which are going to be useless.)

The fix is to stop unrolling everything; for this sort of code, you want to maximize sharing as the overhead of extra instructions to access data in more complex patterns will be absorbed by the cost of dealing with the (presumably) large underlying dataset anyway. It's also possible that the code generator will even do this by default, and that the scientist saw some options for unrolling (with the note that these sometimes improve speed) and turned them all on at once and is now insisting that this resulting mess be accepted by the computer, rather than accepting the machine's real restrictions and using the numerically correct version that is generated by default. But if the code generator won't do it, get one that will (or hack the existing code).

The bottom line: compiling and linking 2.8GB of code doesn't work and shouldn't be forced to work. Find another way.