Intel CPU的BUG导致reboot起不来

  • Post author:
  • Post category:IT
  • Post comments:0评论

这个BUG是我去年11月撞见的,早该写出来了。因为这个BUG造成的灾难后果远远超出我的想像。

当时的现象是某些机器重启后起不来,/var/log/message中有这样的信息:

Nov 15 03:46:09 kernel: INFO: task sh:7684 blocked for more than 120 seconds.

Nov 15 03:46:09 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Nov 15 03:46:11 kernel: Call Trace:

Nov 15 03:46:11 kernel: [] ? ext4_file_open+0x0/0x130 [ext4]

Nov 15 03:46:11 kernel: [] schedule_timeout+0x215/0x2e0

Nov 15 03:46:12 kernel: [] ? nameidata_to_filp+0x54/0x70

Nov 15 03:46:12 kernel: [] ? cpumask_next_and+0x29/0x50

Nov 15 03:46:12 kernel: [] wait_for_common+0x123/0x180

Nov 15 03:46:12 kernel: [] ? default_wake_function+0x0/0x20

Nov 15 03:46:13 kernel: [] wait_for_completion+0x1d/0x20

Nov 15 03:46:13 kernel: [] sched_exec+0xdc/0xe0

Nov 15 03:46:13 kernel: [] do_execve+0xe0/0x2c0

Nov 15 03:46:13 kernel: [] sys_execve+0x4a/0x80

Nov 15 03:46:13 kernel: [] stub_execve+0x6a/0xc0

 上网一查,发现这是一个已知的BUG, 请见 http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-family-spec-update.pdf 里面的BT81,我摘抄如下:

BT81. TSC is Not Affected by Warm Reset

Problem
: The TSC (Time Stamp Counter MSR 10H) should be cleared on reset. Due to this erratum the TSC is not affected by warm reset.

Implication
: The TSC is not cleared by a warm reset. The TSC is cleared by power-on reset as expected. Intel has not observed any functional failures due to this erratum.

它说的理直气壮好像没事。

实际上只要满足以下三个条件:

1. 操作系统为Red Hat Enterprise Linux 6.1 - 6.4。(6.5及以上没问题)

2. CPU属于Intel® Xeon® E5, Intel® Xeon® E5 v2, 或 Intel® Xeon® E7 v2 系列。

3. 大约200天以上没有断电重启过。(是指没有hard reset。远程在Linux里敲reboot不算是)

就会导致操作系统reboot失败。临时的解决办法就是:找人去机房,断电,然后再起来。

具体请参见Red Hat的声明:https://access.redhat.com/solutions/433883

如果你对比以上条件发现自己中招了,赶紧升级kernel吧。

发表评论