Series SSD: 2. System Level Reliability

Quick RAID data loss estimate

Now that we understand the mechanisms contributing to data loss, we can do a quick estimate of the probability of data loss. What follows is an example of how to compute this for the RAID 5 array example from above. Let’s assume that the disks are specified at 1 Million hours MTBF (mean time between failures). While this seems to be a very long time, it’s really not very good.

An aside on MTBF – reliability through obscurity

I find the use of MTBF for component reliability to give a false sense of security. This is because there are about 8,760 hours in year. So while 1M hours is a 114 years, it means the probability of failure is just a little less than 1% per year, which isn’t that fantastic. That’s why I prefer the annual failure rate, or AFR. This is the probability of failure per year, or 8,760/MTBF (assuming 100% duty cycle, which is a good assumption for enterprise applications). Thus, our 1MH MTBF is an AFR of 0.88%. Looks a lot like the non-recoverable read error rates from part 1 of this series.

So, we have and AFR of 0.9%, and a total of 7 disks in the array (we don’t need to consider the spare since it has no data on it), each with 8% NRRE/TB.

We can compute the probability of data loss if we make a few important assumptions:

  1. That the probability of disk failures are independent of time
  2. That the probability of disk failures are independent of each other
  3. That the probability of non-recoverable read errors are independent of time
  4. That the probability of non-recoverable read errors are independent of each other on separate disks

If these conditions are met, then the probability of disk loss per year, \(P_{Disk}\) in an array of \(n\) disks is given by the binomial:

$$ P_{Disk1} = 1 – {n \choose 0}AF R^0\left(1-AF R\right)^{\left(n-0\right)}$$ Eqn. 1

If the capacity of each disk is \(C\) TB, then on a single disk failure, the probability of encountering a non-recoverable read error during the rebuild is (assuming the array was free of latent non-recoverable read errors at the start of the rebuild):

$$ P_{StripLoss} = 1 – {C \choose 0} NRR E \left(1-NRR E\right)^{\left(C-0\right)}.$$ Eqn. 2

The probability of losing a second disk during the rebuild depends on the time it takes to read all the data from the remaining disks and write the data to the spare disk. (If a spare isn’t available, then we need to add the time to obtain the spare). Call the rebuild time in hours \(Rb_H\), then the probability of a disk loss in \(Rb_H\) hours is

$$ P_{DiskRb} =\frac{AF R*Rb_H}{8760}.$$ Eqn. 3

The probability of a second disk loss in the array is then

$$P_{Disk2} = 1 – {n-1 \choose 1} {P_{DiskRb}}^1\left(1-P_{DiskRb}\right)^{\left(n-1\right)}.$$ Eqn. 4

We can now compute the probability of an array loss data loss event, which is just

$$ P_{ArrayLoss} = P_{Disk1} * P_{Disk2}.$$ Eqn. 5

Finally, the total probability of data loss is roughly

$$ P_{DataLoss} = P_{ArrayLoss} + P_{StripLoss}.$$ Eqn. 6

In Table 2 below I have summarized the results for consumer hard disks.

Table 2. Quick RAID 5 data loss estimate for consumer hard disks.
SSD2 table 1

We can clearly see that the data loss is dominated by strip loss events. This loss rate is clearly unacceptable, which is why RAID 5 is not useful for such a setup, and a greater degree of protection is required.

Leave a Reply

Your email address will not be published.