Software memory bit-flip detection for platforms without ECC

osgx picture osgx · May 11, 2014 · Viewed 10k times · Source

Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study "Data integrity": "Bit Error Rate of 10-12 for their memory modules ... observed error rate is 4 orders of magnitude lower than expected"; 2009 Google's "DRAM Errors in the Wild: A Large-Scale Field Study"). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM ("mean correctable error rates of 2000–6000 per GB per year").

So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?

For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?

I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.

I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.

Answer

BraveNewCurrency picture BraveNewCurrency · Jun 18, 2014

The thing is, ECC is dirt cheap compared to "software ECC countermeasures". You can easily detect if they have ECC modules and complain (or print a warning) when they don't.

http://www.cyberciti.biz/faq/ecc-memory-modules/

For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?

Er, you you will never "see" the bit-flips on the bus. They are literally caused by a particle hitting RAM, flipping a bit. Only much later can you notice that you read out something different than your wrote in. To detect this only via the bus, you would need a duplicate copy of all your RAM (i.e. create a shadow copy of what is in your real RAM, so you can verify every read returns what was written to that location.)

try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?

The Redis guy has a nice write-up on an algorithm for testing RAM for problems. http://antirez.com/news/43 But this is really looking for RAM errors, not random bit-flips.

If "recompute checksums" only works when you are NOT writing to the memory. That might be "good enough" but you'll need to figure out which pages are not being written to.

To catch 100% of the errors, every write must be pre-ceeded by computing the checksum of that block of memory, then comparing it to the recorded checksum (to make sure that block hasn't degraded in RAM). Only then is it safe to do the write and then update the checksum. As you can imagine, the performance of this will be horrible (at least 100x slower) performance.

I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.

Well, there is a simple method to detect 100% of the errors, at a cost of 50% performance: Just run the computation on 2 boxes at once (or on one box at two different times, maybe with a RAM test in between if you are paranoid.) If the results differ, you have detected an error.

See also:

https://www.linuxquestions.org/questions/linux-hardware-18/how-to-detect-ecc-memory-errors-under-linux-886011/