The other contributor to reliability
It seems to be rather common to focus on complete device failure as a major contributor to reliability. The impact of non-recoverable read errors (NRRE) tend to be unappreciated in this regard, since it depends on the system architecture.
In this section, I’ll use hard disks to illustrate the situation, but as I’ll show in subsequent posts, the analysis applies to solid state devices as well.
Storage devices aren’t always capable of returning all the data stored on them. In storage systems, the term “non-recoverable read error” refers to the failure of a read operation to return the requested data after exhausting all means to recover it. Thus, such errors represent data loss events at the device level. When such an error occurs, more than one bit is lost, typically an entire data sector (512 bytes in most storage devices). They are also referred to as unrecoverable bit errors. They are distinct from undetected read errors — which are silent data corruption events.
In storage devices, non-recoverable read errors are typically specified by the error interval, such as less than 1 event per some number of bits transferred (or read). For example, a specification might be < 1 event in 1014 bits transferred. I find that specifying a maximum number of errors in an interval is not very meaningful statistically. Further, while 1014 bits seems really large, I should point out that there are 0.08 × 1014 bits in a terabyte!
A more meaningful specification
Let’s try and find some units that make it easier to understand the rate of occurrence of non-recoverable read errors than using events per bits transferred.
Why not a non-recoverable read error rate specified as per TB transferred?
Consumer refers to low cost consumer-grade hard disks, enterprise capacity to the enterprise-grade low cost, capacity optimized disks and enterprise high performance refers to 10,000 and 15,000 RPM enterprise-grade disks. The typical non-recoverable read error interval specifications are listed in the second row. The third row lists the equivalent specifications using my proposed non-recoverable read error rate per TB (NRRE/TB) method.
The truth hurts
The specification for enterprise disks at 0.08%/TB is not that comforting either.
So why hasn’t the industry adopted such a useful specification? Look at the NRRE/TB for consumer disks – it’s 8%/TB! My opinion is that such a specification is too close to the truth, and makes manufacturer’s feel uncomfortable. (It also appears they have a low opinion of the mathematical capabilities of their customers!)
A system-oriented specification
I also find it instructive to look at specifications from an application point of view. In the case of storage systems, this means looking at how non-recoverable read errors impact system reliability.
Non-recoverable read errors per drive capacity
Storage systems commonly use some form of redundancy (such as RAID) to protect against device loss. Once a sufficient number of device failures have accumulated, all the redundancy is exhausted and all of the remaining contents must be read successfully to avoid data loss. (We can call this an “unprotected rebuild”, since the system data is rebuilt with no protection.) Since we need to read all the data, why not have a specification which expresses the probability of being able to successfully read the contents of an entire drive?