# Quick RAID data loss estimate

Now that we understand the mechanisms contributing to data loss, we can do a quick estimate of the probability of data loss. What follows is an example of how to compute this for the RAID 5 array example from above. Let’s assume that the disks are specified at 1 Million hours MTBF (mean time between failures). While this seems to be a very long time, it’s really not very good.

## An aside on MTBF – reliability through obscurity

I find the use of MTBF for component reliability to give a false sense of security. This is because there are about 8,760 hours in year. So while 1M hours is a 114 years, it means the probability of failure is just a little less than 1% per year, which isn’t that fantastic. That’s why I prefer the annual failure rate, or AFR. This is the probability of failure per year, or 8,760/MTBF (assuming 100% duty cycle, which is a good assumption for enterprise applications). Thus, our 1MH MTBF is an AFR of 0.88%. Looks a lot like the non-recoverable read error rates from part 1 of this series.

So, we have and AFR of 0.9%, and a total of 7 disks in the array (we don’t need to consider the spare since it has no data on it), each with 8% NRRE/TB.

We can compute the probability of data loss if we make a few important assumptions:

- That the probability of disk failures are independent of time
- That the probability of disk failures are independent of each other
- That the probability of non-recoverable read errors are independent of time
- That the probability of non-recoverable read errors are independent of each other on separate disks

If these conditions are met, then the probability of disk loss per year, \(P_{Disk}\) in an array of \(n\) disks is given by the binomial:

$$ P_{Disk1} = 1 – {n \choose 0}AF R^0\left(1-AF R\right)^{\left(n-0\right)}$$ | Eqn. 1 |

If the capacity of each disk is \(C\) TB, then on a single disk failure, the probability of encountering a non-recoverable read error during the rebuild is (assuming the array was free of latent non-recoverable read errors at the start of the rebuild):

$$ P_{StripLoss} = 1 – {C \choose 0} NRR E \left(1-NRR E\right)^{\left(C-0\right)}.$$ | Eqn. 2 |

The probability of losing a second disk during the rebuild depends on the time it takes to read all the data from the remaining disks and write the data to the spare disk. (If a spare isn’t available, then we need to add the time to obtain the spare). Call the rebuild time in hours \(Rb_H\), then the probability of a disk loss in \(Rb_H\) hours is

$$ P_{DiskRb} =\frac{AF R*Rb_H}{8760}.$$ | Eqn. 3 |

The probability of a second disk loss in the array is then

$$P_{Disk2} = 1 – {n-1 \choose 1} {P_{DiskRb}}^1\left(1-P_{DiskRb}\right)^{\left(n-1\right)}.$$ | Eqn. 4 |

We can now compute the probability of an array loss data loss event, which is just

$$ P_{ArrayLoss} = P_{Disk1} * P_{Disk2}.$$ | Eqn. 5 |

Finally, the total probability of data loss is roughly

$$ P_{DataLoss} = P_{ArrayLoss} + P_{StripLoss}.$$ | Eqn. 6 |

In Table 2 below I have summarized the results for consumer hard disks.

We can clearly see that the data loss is dominated by strip loss events. This loss rate is clearly unacceptable, which is why RAID 5 is not useful for such a setup, and a greater degree of protection is required.