System design targets
The above description was looked only at data loss, and we started from the array design. When designing a system, I start by generating program targets, and then end up at the requirements for the product. We only considered data loss caused by full SSD failure in the prior example. Obviously, there are many other factors to consider. From a program perspective, I am interested in customer impact events, warranty expense and customer experience. The severity of a given customer impact event will determine its target. For the case of data loss, 1 per year might be acceptable for a large program. If not, in the above example, we’d need something stronger than RAID 5.
Here is a short list of how I rank some types of customer impact events in order of severity, starting with the most severe:
- Data corruption (these are undetected by the system, and found by the user)
- Data loss
- Loss of access (maintenance is required to recover access)
- Loss of performance
- Unscheduled preventative maintenance
- Scheduled preventative maintenance
Each of these would have a separate target. The upper events have the capability of causing a customer’s business to fail, while the lower events mostly affect warranty expense and customer impressions of reliability.
Program reliability target estimator
I’d like to share a simple estimator I devised to provide an example of how this might be done for an enterprise SSD system. Given all the unknowns, we are only looking for accuracy of an order of magnitude (how well do we really know a component’s failure rate?). We’ll also ignore effects such as the active duty cycle and read/write ratio here.
In keeping with the theme of “speed kills”, we’ll just look at non-recoverable read errors here.
Table 3 lists the program assumptions and targets. Assume we are using enterprise-grade SSDs that perform 40,000 4kB IOPS. Each system has a target useful life (and likely warranty) of 5 years. The program goal is to ship 250,000 SSD units during the program (50,000 units per year).
For NRRE events, we need to compute the field usage for the program. Since NRREs occur at the sector level, this means determining the number of sector operation. Most SSDs use 512B sectors, so there are 8 in a 4kB IO. (Some SSDs use compression/deduplication internally, which adds complexity that we’ll skip here.) Note that total number of sector operations in the field will be 1×10 19 !
Our chosen targets are 50 loss of access events for the full program, and 1 data loss event for the full program. We can’t set these to 0, as that would have no statistical meaning. However, you might try and set them to 0.1 or 0.01, although such tightening will potentially increase the program cost.
In the example of table 3, we have a total of 250,000 units in the field over 5 years. Thus the program consists of 1.5 million unit-years. I find unit-years to be a convenient sizing metric.
Now that we have created our system specification, we need to design a test to see if they are met. Let’s start with a typical HDD-type qualification test. These often comprise 1,000 units for 1,000 hours (sort of a 40 days and 40 nights test). One issue for SSDs is that we can only test retention out to 1,000 hours in such a test. While one can consider accelerated testing, such an approach required validation. I will cover this in a subsequent post.
We have 50+ years of experience with hard disks, and the industry was slow to adopt them. It took 20 years from the first ship until there were 1 million units in the field. (They are currently shipping at over 500 million units per year.) So, it is prudent to hold SSDs to higher testing standards until sufficient field experience has been amassed.
Table 4 shows the test parameters for this example. The test duration is 1,000 hours and uses 1,000 SSDs. Such a test requires time to read the data, so assume we can achieve an 80% duty cycle for writes. Thus, there are 32,000 effective IOPS for this test. In 1,000 hours, we can therefore test 1014 sectors. While this seems like a rather large number, recall from table 3 that the field will perform 1019 sector operations. In the last row, we see that this test is more than 5 orders of magnitude short of the program target of 1 data loss event per program. Thus, it will be rather difficult to extrapolate the program behavior from the result of one such test. Ideally, we expect to have 0 errors during such a test, but that won’t give great confidence at the program target. This test can write 5×1017 bits, which is shy of my 10 18 bits NRRE interval specification for enterprise SSDs.
This result shouldn’t be that surprising. The test has 1,000 units for 1,000 hours, which is about 100 unit-years, while the program is 1.5 million unit-years.
The irony is that this test is too short to give sufficient confidence, yet is so expensive that few manufacturers will undertake such a test with enterprise SSDs.