System reliability targets
Now that we have some motivation, let’s delve into system reliability targets. I feel it is important to express reliability with the customer in mind. It should be readily apparent to the customer what sort of behavior to expect. Thus, units should be chosen that make this abundantly clear, and reflect how the customer experiences failures. Often, this will be different than the way the systems and components fail. It’s the system/component architect’s job to bridge this gap.
Failure events/probabilities for a system should be expressed per unit time.
As mentioned above, this is how customers experience failure events. They want to know how many times per year the system is down, needs maintenance, etc. They don’t measure failures per byte transferred, per IO or other such metric. So, that’s how we should express it for them.
Failure units should be as transparent as possible so that the customer can quickly determine the behavior.
For example, we should choose to represent device failure using annual failure rate (AFR), as opposed to mean time between failures (MTBF). While the conversion is obvious to engineers, MTBF gives a false impression of reliability. An MTBF of 1 million hours sounds infinite. However, since there are nearly 10,000 hours in a year (8,760 hours for those who wish to be explicit), the equivalent AFR is about 1%/year (0.876% for those who think failures can be measured with such precision). Both values accurately reflect the failure rate, however, the AFR makes the reliability much more clear.
Reliability specification types
There are different ways to specify reliability for storage systems. They are all related to each other, thus none are incorrect. However, as we shall see, some provide better clarity. Here are three that are pretty common.
- Product based
- User based
- Program based
Table 2 describes an example SSD storage system, which consists of a single RAID 5 array of SSDs. There are 9 SSDs in the array: 7 data, 1 parity and 1 spare (yes the parity and spare can be spread across the array, but that doesn’t change the calculations). We’ll assume here that the only failures we are need to consider here are full SSD failures — we’ll ignore non-recoverable read errors for this example. This will keep things simple, if not accurate.
Product-based specifications are the easiest to compute, and are the ones usually communicated to customers. However, I find they aren’t the best for architecting systems. For example, they don’t clearly show warranty expense or expected customer impact evens in the field. While a given product may seem reliable in isolation, the program may not make financial sense.
Let’s consider the example of table 2. We have our highly reliable 1.7 million hour MTBF SSDs (that’s the SSD product specification), and we have configured them as RAID 5, so this array should be incredibly reliable. If we just look at array loss (using the method of post 2 in this series), this works out to a probability of 7×10 -6 per year. However, field service is still required when an SSD fails (it needs replacing, even though no data is lost). With 9 SSDs, there is a 4% chance of an SSD loss per year. So the product specifications are data loss rate of 7×10 -6 per year and 4% chance of field service to replace an SSD.
For user-based specifications, we also need to know the size of the typical user deployment. Let’s assume a typical user installation will deploy 50 arrays. Now we have the user probability of data loss/Y at 4×10 -4 and they will see your maintenance person twice a year. As you can see, the customer has a different view of overall reliability than the simple product. For example, they may find the loss of an SSD every six months to be unsettling. On the other hand, they might really like your maintenance person.
The third way is to look at the full program. For these specifications, we need to know the ship rate, the program lifetime and the product lifetime so we can compute the install base as a function of time. Let’s assume this is a large enterprise program, with targets of 50,000 arrays shipped per year, a 3 year program life and 3 year product life. Thus, at the peak we have 150,000 units in the field. In that year, we expect 1 customer data loss event, and 6,600 SSD losses. For the full program, we expect 3 data loss events and 20,000 maintenance events. So, even though the product and user reliability seem quite high, we still must expect to deal with data loss in the field. Gleaning such insights is why I prefer to design to using the program approach.