Towards an improved error reporting infrastructure for Linux. Unified error handling — A worthy goal? These delays include asynchronous hardware reporting of the machine check event, How can a machine check for accessing erroneous memory contents be asynchronous? That’s the stuff Andi Kleen and co. Memory errors are classified as either soft transient or hard permanent. ISTM you want to map a known bad page there instead.

Uploader: Goltishakar
Date Added: 21 October 2017
File Size: 24.45 Mb
Operating Systems: Windows NT/2000/XP/2003/2003/7/8/10 MacOS 10/X
Downloads: 67227
Price: Free* [*Free Regsitration Required]

Take a look here: While the specifics of how hardware and the kernel might implement memory poisoning varies, the general concept is as follows. It is not recommended to use them for planning purposes. That’s the stuff Andi Kleen and co.

Posted Sep 1, ECC is able to recover from multib i y te errors. This document is dated Juneso it’s not like it’s anceint. While single bit data errors can be corrected via ECC, multi-bit data errors cannot. Usenix Annual Tech Conference The famous google memory error study.

For users:

It is in volume 3: Ongoing evolution of Linux x86 machine check handling at LinuxCon How can the CPU continue executing and generate a machine check at some arbitrarily mcce time? Intel’s recent preview of its Xeon processor codenamed Nehalem-EX promises support for memory poisoning.


Unlike clean pages, dirty pages in these caches have differences between the memory and disk copies. Since page flags are currently in short supply, this choice was not made without consternation and debate by kernel hackers.

In either case, the hardware doesn’t immediately cause a machine check but rather flags the data unit as poisoned until read or consumed.

The OS can then take appropriate action, like killing the process with the corrupted data or logging the event properly to mde. One downside to the ever-increasing memory size available on computers is an increase in memory failures. Try to keep everything running as smoothly as possible and only bringing down the affected tasks ibjector any. Note that this property would be system dependent—not all systems would necessarily be this imprecise.

Introduction to platform hardware errors on modern x86 machines including detailed flows and recent improvements to the Linux x86 machine check handling, with a focus on memory errors.


ISTM you want to map a known bad page there instead. Towards an improved error reporting infrastructure for Linux. For “Action Optional” machine checks that can happen asynchronously to program execution such as due to scrubbingthe OS can queue up a handler to go deal with the affected page, either by poisoning it or unmapping it or what-have-you.

To offset this increased error rate, recent processors have included support for “poisoned” memory, an adaptive method for flagging and recovering from memory errors. Machine check handling on Linux paperslides for Linux Kongress These delays include asynchronous hardware reporting of the machine check event, and delayed execution of the handler via a workqueue. injectpr


mcelog — further reading

Huge pages fail since reverse mapping is not supported to identify the process which owns the page. This is in addition to the mcelog test suite included with the source make test. Includes an overview of modern mcelog. Posted Aug 27, Posted Aug 28, 7: This allow system soft- ware to perform recovery action on certain class of uncorrected errors and continue If I’m not mistaken, that’s the processor family this article was referring to.

Perhaps this is handled properly, but by just unmapping, arn’t you running the risk that some later memory allocation by that process might get the same virtual address and thus instead of a SIGBUS the process keeps running with corrupted memory? Posted Dec 4, I found a different