在新的 Linux 内核中上下文切换要慢得多

我们希望将我们服务器上的操作系统从 Ubuntu 10.04 LTS 升级到 Ubuntu 12.04 LTS。不幸的是,运行可运行线程的延迟似乎已经从2.6内核显著增加到了3.2内核。事实上,我们得到的延迟数字是难以置信的。

让我说得更具体一点。我们有一个运行两个线程的程序。第一个线程获取当前时间(使用 RDTSC 以刻度表示) ,然后每秒发出一次条件变量信号。第二个线程等待条件变量,并在收到信号时唤醒它。然后获取当前时间(使用 RDTSC 以刻度表示)。第二个线程中的时间与第一个线程中的时间之间的差异将被计算并显示在控制台上。在此之后,第二个线程再次等待条件变量。大约一秒钟后,第一个线程将再次发出信号。

因此,简而言之,我们得到了一个 通过条件变量进行线程到线程的通信延迟测量结果每秒一次。

在内核2.6.32中,这个延迟在2.8-3.5 us 之间,这是合理的。在内核3.2.0中,这种延迟已经增加到40-100 us。我排除了两台主机在硬件上的任何差异。它们运行在相同的硬件上(双套接字 X5687{ Westmere-EP }处理器运行在3.6 GHz,超线程,快步和所有 C 状态关闭)。测试应用程序改变了线程的亲和力,让它们在同一个套接字的独立物理内核上运行(例如,第一个线程在 Core 0上运行,第二个线程在 Core 1上运行) ,所以线程在内核上不会弹跳,套接字之间也不会弹跳/通信。

两个主机之间的唯一区别是,一个运行的是内核2.6.32-28版本的 Ubuntu 10.04 LTS (快速上下文切换框) ,另一个运行的是内核3.2.0-23版本的最新 Ubuntu 12.04 LTS (慢速上下文切换框)。所有的 BIOS 设置和硬件都是相同的。

内核中是否有任何变化可以解释这种荒谬的线程运行时间的减慢?

更新: 如果你想在你的主机和 linux 版本上运行测试,我有 把代码发到 pastebin供你阅读。编制:

g++ -O3 -o test_latency test_latency.cpp -lpthread

使用以下命令运行(假设您至少有一个双核机顶盒) :

./test_latency 0 1 # Thread 1 on Core 0 and Thread 2 on Core 1

更新2 : 在对内核参数、内核更改和个人研究进行了大量搜索之后,我找出了问题所在,并发布了解决方案作为这个问题的答案。

48126 次浏览

Perhaps what got slower is futex, the building block for condition variables. This will shed some light:

strace -r ./test_latency 0 1 &> test_latency_strace & sleep 8 && killall test_latency

then

for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done

which will show the microseconds taken for the interesting system calls, sorted by time.

On kernel 2.6.32

$ for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done
futex
1.000140 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000129 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000124 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000119 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000106 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000103 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000102 futex(0x601ac4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601ac0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
0.000125 futex(0x7f98ce4c0b88, FUTEX_WAKE_PRIVATE, 2147483647) = 0
0.000042 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
0.000038 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
0.000037 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
0.000030 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
0.000029 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 0
0.000028 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
0.000027 futex(0x601b00, FUTEX_WAKE_PRIVATE, 1) = 1
0.000018 futex(0x7fff82f0ec3c, FUTEX_WAKE_PRIVATE, 1) = 0
nanosleep
0.000027 nanosleep({1, 0}, {1, 0}) = 0
0.000019 nanosleep({1, 0}, {1, 0}) = 0
0.000019 nanosleep({1, 0}, {1, 0}) = 0
0.000018 nanosleep({1, 0}, {1, 0}) = 0
0.000018 nanosleep({1, 0}, {1, 0}) = 0
0.000018 nanosleep({1, 0}, {1, 0}) = 0
0.000018 nanosleep({1, 0}, 0x7fff82f0eb40) = ? ERESTART_RESTARTBLOCK (To be restarted)
0.000017 nanosleep({1, 0}, {1, 0}) = 0
rt_sig
0.000045 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000040 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000038 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000034 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000033 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000032 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000032 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000028 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000028 rt_sigaction(SIGRT_1, {0x37f8c052b0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x37f8c0e4c0}, NULL, 8) = 0
0.000027 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000027 rt_sigaction(SIGRTMIN, {0x37f8c05370, [], SA_RESTORER|SA_SIGINFO, 0x37f8c0e4c0}, NULL, 8) = 0
0.000027 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000023 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000023 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000022 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
0.000022 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000021 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000021 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000021 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000021 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000021 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000019 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

On kernel 3.1.9

$ for i in futex nanosleep rt_sig;do echo $i;grep $i test_latency_strace | sort -rn;done
futex
1.000129 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000126 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000122 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000115 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000114 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000112 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
1.000109 futex(0x601764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x601760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
0.000139 futex(0x3f8b8f2fb0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
0.000043 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
0.000041 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
0.000037 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
0.000036 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
0.000034 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
0.000034 futex(0x601720, FUTEX_WAKE_PRIVATE, 1) = 1
nanosleep
0.000025 nanosleep({1, 0}, 0x7fff70091d00) = 0
0.000022 nanosleep({1, 0}, {0, 3925413}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
0.000021 nanosleep({1, 0}, 0x7fff70091d00) = 0
0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
0.000017 nanosleep({1, 0}, 0x7fff70091d00) = 0
rt_sig
0.000045 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000044 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000043 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000040 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000038 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000037 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000036 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000036 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000035 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000035 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000034 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000031 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000027 rt_sigaction(SIGRT_1, {0x3f892067b0, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x3f8920f500}, NULL, 8) = 0
0.000026 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000026 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000025 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000024 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000023 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
0.000023 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
0.000022 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
0.000021 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
0.000019 rt_sigaction(SIGRTMIN, {0x3f89206720, [], SA_RESTORER|SA_SIGINFO, 0x3f8920f500}, NULL, 8) = 0

I found this 5 year old bug report that contains a "ping pong" performance test that compares

  1. single-threaded libpthread mutex
  2. libpthread condition variable
  3. plain old Unix signals

I had to add

#include <stdint.h>

in order to compile, which I did with this command

g++ -O3 -o condvar-perf condvar-perf.cpp -lpthread -lrt

On kernel 2.6.32

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    29085 us; per iteration:   29 ns / 9.4e-05 context switches.
c.v. ping-pong test   elapsed:  4771993 us; per iteration: 4771 ns / 4.03 context switches.
signal ping-pong test elapsed:  8685423 us; per iteration: 8685 ns / 4.05 context switches.

On kernel 3.1.9

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    26811 us; per iteration:   26 ns / 8e-06 context switches.
c.v. ping-pong test   elapsed: 10930794 us; per iteration: 10930 ns / 4.01 context switches.
signal ping-pong test elapsed: 10949670 us; per iteration: 10949 ns / 4.01 context switches.

I conclude that between kernel 2.6.32 and 3.1.9 context switch has indeed slowed down, though not as much as you observe in kernel 3.2. I realize this doesn't yet answer your question, I'll keep digging.

Edit: I've found that changing the real time priority of the process (both threads) improves the performance on 3.1.9 to match 2.6.32. However, setting the same priority on 2.6.32 makes it slow down... go figure - I'll look into it more.

Here's my results now:

On kernel 2.6.32

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    29629 us; per iteration:   29 ns / 0.000418 context switches.
c.v. ping-pong test   elapsed:  6225637 us; per iteration: 6225 ns / 4.1 context switches.
signal ping-pong test elapsed:  5602248 us; per iteration: 5602 ns / 4.09 context switches.
$ chrt -f 1 ./condvar-perf 1000000
NPTL
mutex                 elapsed:    29049 us; per iteration:   29 ns / 0.000407 context switches.
c.v. ping-pong test   elapsed: 16131360 us; per iteration: 16131 ns / 4.29 context switches.
signal ping-pong test elapsed: 11817819 us; per iteration: 11817 ns / 4.16 context switches.
$

On kernel 3.1.9

$ ./condvar-perf 1000000
NPTL
mutex                 elapsed:    26830 us; per iteration:   26 ns / 5.7e-05 context switches.
c.v. ping-pong test   elapsed: 12812788 us; per iteration: 12812 ns / 4.01 context switches.
signal ping-pong test elapsed: 13126865 us; per iteration: 13126 ns / 4.01 context switches.
$ chrt -f 1 ./condvar-perf 1000000
NPTL
mutex                 elapsed:    27025 us; per iteration:   27 ns / 3.7e-05 context switches.
c.v. ping-pong test   elapsed:  5099885 us; per iteration: 5099 ns / 4 context switches.
signal ping-pong test elapsed:  5508227 us; per iteration: 5508 ns / 4 context switches.
$

The solution to the bad thread wake up performance problem in recent kernels has to do with the switch to the intel_idle cpuidle driver from acpi_idle, the driver used in older kernels. Sadly, the intel_idle driver ignores the user's BIOS configuration for the C-states and dances to its own tune. In other words, even if you completely disable all C states in your PC's (or server's) BIOS, this driver will still force them on during periods of brief inactivity, which are almost always happening unless an all core consuming synthetic benchmark (e.g., stress) is running. You can monitor C state transitions, along with other useful information related to processor frequencies, using the wonderful Google i7z tool on most compatible hardware.

To see which cpuidle driver is currently active in your setup, just cat the current_driver file in the cpuidle section of /sys/devices/system/cpu as follows:

cat /sys/devices/system/cpu/cpuidle/current_driver

If you want your modern Linux OS to have the lowest context switch latency possible, add the following kernel boot parameters to disable all of these power saving features:

On Ubuntu 12.04, you can do this by adding them to the GRUB_CMDLINE_LINUX_DEFAULT entry in /etc/default/grub and then running update-grub. The boot parameters to add are:

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

Here are the gory details about what the three boot options do:

Setting intel_idle.max_cstate to zero will either revert your cpuidle driver to acpi_idle (at least per the documentation of the option), or disable it completely. On my box it is completely disabled (i.e., displaying the current_driver file in /sys/devices/system/cpu/cpuidle produces an output of none). In this case the second boot option, processor.max_cstate=0 is unnecessary. However, the documentation states that setting max_cstate to zero for the intel_idle driver should revert the OS to the acpi_idle driver. Therefore, I put in the second boot option just in case.

The processor.max_cstate option sets the maximum C state for the acpi_idle driver to zero, hopefully disabling it as well. I do not have a system that I can test this on, because intel_idle.max_cstate=0 completely knocks out the cpuidle driver on all of the hardware available to me. However, if your installation does revert you from intel_idle to acpi_idle with just the first boot option, please let me know if the second option, processor.max_cstate did what it was documented to do in the comments so that I can update this answer.

Finally, the last of the three parameters, idle=poll is a real power hog. It will disable C1/C1E, which will remove the final remaining bit of latency at the expense of a lot more power consumption, so use that one only when it's really necessary. For most this will be overkill, since the C1* latency is not all that large. Using my test application running on the hardware I described in the original question, the latency went from 9 us to 3 us. This is certainly a significant reduction for highly latency sensitive applications (e.g., financial trading, high precision telemetry/tracking, high freq. data acquisition, etc...), but may not be worth the incurred electrical power hit for the vast majority of desktop apps. The only way to know for sure is to profile your application's improvement in performance vs. the actual increase in power consumption/heat of your hardware and weigh the tradeoffs.

Update:

After additional testing with various idle=* parameters, I have discovered that setting idle to mwait if supported by your hardware is a much better idea. It seems that the use of the MWAIT/MONITOR instructions allows the CPU to enter C1E without any noticeable latency being added to the thread wake up time. With idle=mwait, you will get cooler CPU temperatures (as compared to idle=poll), less power use and still retain the excellent low latencies of a polling idle loop. Therefore, my updated recommended set of boot parameters for low CPU thread wake up latency based on these findings is:

intel_idle.max_cstate=0 processor.max_cstate=0 idle=mwait

The use of idle=mwait instead of idle=poll may also help with the initiation of Turbo Boost (by helping the CPU stay below its TDP [Thermal Design Power]) and hyperthreading (for which MWAIT is the ideal mechanism for not consuming an entire physical core while at the same time avoiding the higher C states). This has yet to be proven in testing, however, which I will continue to do.

Update 2:

The mwait idle option has been removed from newer 3.x kernels (thanks to user ck_ for the update). That leaves us with two options:

idle=halt - Should work as well as mwait, but test to be sure that this is the case with your hardware. The HLT instruction is almost equivalent to an MWAIT with state hint 0. The problem lies in the fact that an interrupt is required to get out of a HLT state, while a memory write (or interrupt) can be used to get out of the MWAIT state. Depending on what the Linux Kernel uses in its idle loop, this can make MWAIT potentially more efficient. So, as I said test/profile and see if it meets your latency needs...

and

idle=poll - The highest performance option, at the expense of power and heat.

You might also see processors clicking down in more recent processes and Linux kernels due to the pstate driver which is separate from c-states. So in addition, to disable this, you the following kernel parameter:

intel_pstate=disable