Why this is important
Stand alone disks
With stand alone disk (that is, not in RAID system), the impact of non-recoverable read error depends on the application and on the average data rate. A common application for consumer-grade hard disks is in digital video recorders. Given that the error rate on cable and satellite inputs – reported at 440 errors/TB – is significantly worse that the 0.08 NRER/TB for the HDD, the HDD impact can be safely ignored. (Thus hard disk manufacturers have little incentive to improve the behavior here.)
Now consider a consumer-grade hard disk in a PC application. If we assume the HDD is used for 2,000 hours per year, and that the average data rate is about 100kB/s, then about 1TB/year is transferred. This would be about 5%/year chance of a non-recoverable read error. So, we do expect to see non-recoverable read errors in this application (although they may be blamed on software).
In storage systems
When an array has accumulated sufficient failures to become unprotected, recovering all the data required reading the capacity of the remaining drives error free. Consider reconstructing all the data from an unprotected array of 5 3TB consumer-grade disks (good luck with that!). Reading 15TB at 8% NRRE/TB means we expect to see 1.2 non-recoverable errors on average. In other words, the operation is more likely to fail than to succeed.
Another observation is that conventional RAID doesn’t provide the device failure protection commonly expected. As we saw in the consumer-grade hard disk example, RAID 5 could not be claimed to provide single disk failure protection, since the probability of data loss per year was greater than 2%. So, if the NRRE/TB is large enough, then we should subtract 1 from the expected device failure protection in a conventional RAID scheme. That is RAID 5 protects against 0 device failures (not a very useful concept). RAID 6, which has 2 devices worth of redundancy provides only single device failure protection, and so on.
In the early days of RAID, the concept of “scrubbing” was introduced to deal with the non-recoverable read error problem. Scrubbing involves periodically verifying that the data on each storage unit can be read without error. When the non-recoverable read errors per drive capacity was much smaller, this technique worked quite well. However, this is no longer the case. In the above analyses, I have always assumed that the array had no latent non-recoverable read errors when the reconstruction operation began.
In the case of consumer-grade hard disks, we have already reached the point that scrubbing can’t make the strip loss probability low enough to ignore with RAID 5. It may still be of use to limit the density of non-recoverable read errors. Thus, non-recoverable read errors are forcing systems to incorporate multiple disk loss protection RAID systems, but must reserve protection for non-recoverable read errors during rebuild. In effect, it means the net disk failure protection is one less than the theoretical limit.
We have shown that the reliability of a storage system depends on the ability of a storage device to successfully read data in addition to the device’s probability of outright failure. In the RAID 5 example with consumer-grade hard disks, the system reliability was dominated by non-recoverable read errors. Thus, we can’t say one device is more reliable than the next if we only consider the device failure rate. If we insist on reading data from the device (they are not of much use otherwise), then we must also consider the non-recoverable read error rate and how it is manifested in the system.
Things to ponder
Before we head off into the next section, we should ask what would the impact of NRER/TB be if we use the device at higher IO rates?