MoreFine S500+ AMD ES CPU的诡异问题

在淘宝上买了几台便宜的MoreFine S500+用来在家里跑虚拟化,CPU是AMD Ryzen 9 5900HX ES(100-000000300-30_Y),其中有两台机器很诡异,PVE只要一跑apt upgrade,必定死机重启,屡试不爽。其他时候一点问题都没有。整个过程没有任何log,没有任何core dump,实在是诡异。一开始以为是内存的问题,换过内存后还是一样,重置BIOS也不能解决问题。

偶然的机会,发现dmesg有error,系统不能开启TSC,并且每次重启,报错的CPU核心都不一样

root@pve-2:~# dmesg | grep -i -e tsc -e clocksource
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2894.532 MHz processor
[    0.073583] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.183110] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.203134] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x29b91578521, max_idle_ns: 440795257552 ns
[    0.363639] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.393489] clocksource: Switched to clocksource tsc-early
[    0.402132] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.747482] tsc: Refined TSC clocksource calibration: 2916.653 MHz
[    1.747497] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2a0ab6c778e, max_idle_ns: 440795302966 ns
[    1.747550] clocksource: Switched to clocksource tsc
[    2.707557] clocksource: timekeeping watchdog on CPU4: Marking clocksource 'tsc' as unstable because the skew is too large:
[    2.707561] clocksource:                       'hpet' wd_nsec: 483806670 wd_now: 228c2e8 wd_last: 1bf0f69 mask: ffffffff
[    2.707563] clocksource:                       'tsc' cs_nsec: 480142269 cs_now: 8af50cb38 cs_last: 85bd841a9 mask: ffffffffffffffff
[    2.707564] clocksource:                       'tsc' is current clocksource.
[    2.707569] tsc: Marking TSC unstable due to clocksource watchdog
[    2.707577] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[    2.707683] clocksource: Checking clocksource tsc synchronization from CPU 6 to CPUs 0-1,4,11-12,14.
[    2.963157] clocksource: Switched to clocksource hpet
[   13.143631] kvm: SMP vm created on host with unstable TSC; guest TSC will not be reliable
root@pve-2:~# dmesg | grep -i -e tsc -e clocksource
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2894.646 MHz processor
[    0.073721] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.183429] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.203452] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x29b9816cba9, max_idle_ns: 440795228300 ns
[    0.364043] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.394903] clocksource: Switched to clocksource tsc-early
[    0.402440] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.427510] tsc: Refined TSC clocksource calibration: 2916.910 MHz
[    1.427523] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x2a0ba9bf788, max_idle_ns: 440795338659 ns
[    1.427582] clocksource: Switched to clocksource tsc
[    2.227472] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[    2.227474] clocksource:                       'hpet' wd_nsec: 511739899 wd_now: 1bfc10f wd_last: 14ff33f mask: ffffffff
[    2.227476] clocksource:                       'tsc' cs_nsec: 507818877 cs_now: bffb52e1c cs_last: ba76aea15 mask: ffffffffffffffff
[    2.227477] clocksource:                       'tsc' is current clocksource.
[    2.227482] tsc: Marking TSC unstable due to clocksource watchdog
[    2.227490] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[    2.227604] clocksource: Checking clocksource tsc synchronization from CPU 0 to CPUs 1,3,7,9-10,12-13.
[    2.767474] clocksource: Switched to clocksource hpet
[   13.901839] kvm: SMP vm created on host with unstable TSC; guest TSC will not be reliable
root@pve-2:~#  cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

在GRUB添加了tsc=unstable以后也是解决不了问题。

试过找厂家要BIOS,但厂家发过来的BIOS版本比板子上的还老。也放弃了。

本来已经绝望放弃了,最近几天梦里一直都梦见这个问题,梦里灵光一闪,今天从箱子里翻出来了一个电源,抱着死马当活马医的想法,换了一个电源。

结果,好了。

root@pve-2:~# dmesg | grep -i -e tsc -e clocksource
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2894.587 MHz processor
[    0.073311] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.182392] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[    0.202416] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x29b949523a8, max_idle_ns: 440795242157 ns
[    0.362916] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.393775] clocksource: Switched to clocksource tsc-early
[    0.401391] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.526428] tsc: Refined TSC clocksource calibration: 2894.557 MHz
[    1.526436] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x29b92d45331, max_idle_ns: 440795334534 ns
[    1.526473] clocksource: Switched to clocksource tsc