Skip to content

Commit 40c36e2

Browse files
aeglKAGA-KOKO
authored andcommitted
x86/mce: Fix incorrect "Machine check from unknown source" message
Some injection testing resulted in the following console log: mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134 mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]} mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88 mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043 mce: [Hardware Error]: Run the above through 'mcelog --ascii' Kernel panic - not syncing: Machine check from unknown source This confused everybody because the first line quite clearly shows that we found a logged error in "Bank 1", while the last line says "unknown source". The problem is that the Linux code doesn't do the right thing for a local machine check that results in a fatal error. It turns out that we know very early in the handler whether the machine check is fatal. The call to mce_no_way_out() has checked all the banks for the CPU that took the local machine check. If it says we must crash, we can do so right away with the right messages. We do scan all the banks again. This means that we might initially not see a problem, but during the second scan find something fatal. If this happens we print a slightly different message (so I can see if it actually every happens). [ bp: Remove unneeded severity assignment. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ashok Raj <ashok.raj@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: stable@vger.kernel.org # 4.2 Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com
1 parent 1f74c8a commit 40c36e2

File tree

1 file changed

+18
-8
lines changed
  • arch/x86/kernel/cpu/mcheck

1 file changed

+18
-8
lines changed

arch/x86/kernel/cpu/mcheck/mce.c

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1207,13 +1207,18 @@ void do_machine_check(struct pt_regs *regs, long error_code)
12071207
lmce = m.mcgstatus & MCG_STATUS_LMCES;
12081208

12091209
/*
1210+
* Local machine check may already know that we have to panic.
1211+
* Broadcast machine check begins rendezvous in mce_start()
12101212
* Go through all banks in exclusion of the other CPUs. This way we
12111213
* don't report duplicated events on shared banks because the first one
1212-
* to see it will clear it. If this is a Local MCE, then no need to
1213-
* perform rendezvous.
1214+
* to see it will clear it.
12141215
*/
1215-
if (!lmce)
1216+
if (lmce) {
1217+
if (no_way_out)
1218+
mce_panic("Fatal local machine check", &m, msg);
1219+
} else {
12161220
order = mce_start(&no_way_out);
1221+
}
12171222

12181223
for (i = 0; i < cfg->banks; i++) {
12191224
__clear_bit(i, toclear);
@@ -1289,12 +1294,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
12891294
no_way_out = worst >= MCE_PANIC_SEVERITY;
12901295
} else {
12911296
/*
1292-
* Local MCE skipped calling mce_reign()
1293-
* If we found a fatal error, we need to panic here.
1297+
* If there was a fatal machine check we should have
1298+
* already called mce_panic earlier in this function.
1299+
* Since we re-read the banks, we might have found
1300+
* something new. Check again to see if we found a
1301+
* fatal error. We call "mce_severity()" again to
1302+
* make sure we have the right "msg".
12941303
*/
1295-
if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
1296-
mce_panic("Machine check from unknown source",
1297-
NULL, NULL);
1304+
if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
1305+
mce_severity(&m, cfg->tolerant, &msg, true);
1306+
mce_panic("Local fatal machine check!", &m, msg);
1307+
}
12981308
}
12991309

13001310
/*

0 commit comments

Comments
 (0)