Series SSD: 2. System Level Reliability

In this series, we are considering the use of SSDs in storage systems. Thus, we need to consider reliability at the system level. In most storage systems, the storage devices are configured in a RAID (Redundant Array  of Independent Disks). This allows the system to preserve data in the presence of failures.

In order to assess the impact of RAID on reliability, we need to understand the data loss mechanisms. RAID 5 is a common configuration which protects an array against data loss in the event of a single failure. You can think of of it as adding a parity unit to set of data units. In the examples below, we will use numbers to refer to data units, ‘P’ to refer to parity units, and ‘S’ to refer to spare units. Going forward, we’ll generically refer to all such redundancies as parity units. (In this exercise, we can safely ignore the parity rotation aspects of RAID 5.)

Array Loss

Since RAID 5 protects against single failures, we need to examine the impact of double failures. There are three different types of double failures. First, the array can lose first disk, then lose a second disk before the missing data has been rebuilt onto a spare disk. This is commonly referred to as an array loss event.

The array loss video above shows what can happen. In this case, we have a RAID 5 system, with 6 data disks (labeled 1 through 6), one parity disk (labeled P) and one spare disk (labeled S). Once disk 4 fails, there is enough information in the array to recover all of disk 4’s data, and rebuild it onto the spare. However, were another disk to fail during this rebuild operation (which typically is many hours long), then some of the disk 4’s data will not be recoverable, and we have what is called an array loss event.

Strip loss

A second mode of data loss occurs when the array has lost one disk, and subsequently encounters a non-recoverable read error while rebuilding the missing data. This is commonly referred to as a strip loss event, where some portion of a strip’s worth of data is lost. This is because a RAID 5 array is unprotected after a single disk loss.

The strip loss video above shows an example of such a data loss. Here, the array configuration is the same as in video 1. As in the array loss case, we start with disk 4 failing, then rebuilding the data onto the spare disk. However, instead of having an entire disk fail during the rebuild process, a non-recoverable read error is encountered on disk 6. The parity strip containing this non-recoverable read error has lost 2 pieces of data – the data on disk 4 and the data on disk 6. Thus, there is insufficient information to reconstruct the associated data from disk 4. At the end of the process, we are missing data on disk 6 from the non-recoverable read error, and the data on disk 4 that couldn’t be reconstructed. While the total amount of data lost is smaller than in an array loss, it is still a data loss event, and may be just as catastrophic as an array loss. Such an event is called a strip loss.

There is third combination of two failures that leads to data loss. The array can have 2 non-recoverable read errors in the same parity strip. Having non-recoverable read errors line up like this can be ignored to first order, since in hard disks such errors do not exhibit a tendency to correlate between drives, and the number of strips in an array is extremely large. Essentially, the probability is something like the square of the non-recoverable error rate, thus can be safely ignored here. So, when estimating the probability of data loss in a system, we need only consider array loss and strip loss.

Notes:
1. I have had systems people tell me that strip loss should be thought of as less serious than array loss since most arrays aren’t filled to capacity, thus it’s likely that it missed user data. I don’t feel this is a wise argument. Even in an lightly filled array, it’s like throwing a rock at a building and hoping not to hit a window. You might get lucky. However, if one of your drives has failed, your luck is already suspect.
I have had systems people tell me that strip loss should be thought of as less serious than array loss since most arrays aren’t filled to capacity, thus it’s likely that it missed user data. I don’t feel this is a wise argument. Even in an lightly filled array, it’s like throwing a rock at a building and hoping not to hit a window. You might get lucky. However, if one of your drives has failed, your luck is already suspect.

Leave a Reply

Your email address will not be published.