Common oversights with SSD specifications

It is commonly assumed that solid state disks are more reliable than hard disks because they have no moving parts. While the evidence for this is lacking, we’ll look at the impact of non-recoverable read errors on reliability.

SSD Specifications

The reliability specifications for SSDs differ somewhat from those of hard disks. The relevant JEDEC standard document is JESD218. I won’t go into a detailed review of this document, but just pick out some of the relevant information. JEDEC uses the term UBER (uncorrectable bit error rate), which relates to the NRRE as:

$$NRR E = \frac{1}{UBE R}$$.

Remember, these types of errors (UBER or NRRE) are data loss events.

We should note that JESD218 defines two classes of SSDs – client and enterprise. The client devices are specified to operate only 8 hours/day and at 40C, while enterprise devices operate 24 hours per day at 55C.

SSD and HDD reliability comparison

SSDs support much higher small block random IO rates than HDDs. This higher performance in these workloads is the main reason to use SSDs in a storage system (the sequential IO performance is often no better than HDD). Therefore, it is valid to consider reliability for an all small block IO workload.

Assume a workload of all 4kB random read IOs, such as is typical of on-line transaction processing. (NRRE behavior is measured during read operations.) Let’s see how the SSDs stack up using the JEDEC JESD218 specifications for NRRE (UBER in SSD parlance).

In table 1, C HDD refers to a consumer HDD, EC to a capacity-optimized enterprise HDD, and EP HDD to a performance-optimized enterprise  HDD. C SSD refers to a client SSD and E SDD to an enterprise SSD. The second row lists typical performance for sustained 100% random 4kB reads in IO per second (IOPS). The SSDs clearly outclass the HDDs here. However, with great performance comes great reliability requirements!

Requiring systems and users to track bits transferred reminds me of an old management school adage — “don’t make your problem my problem”. SSD vendors would do well to keep this in mind.

The 3rd row shows the effective data rate in MB/s. The 4th row lists the NRRE specification for each class of device. The SSD values are from Table 1 in JESD218 (converted from UBER). Note that JEDEC appears to have cloned the EC HDD and EP HDD specifications here. It appears they felt that the proper reliability metric was failures per bit transferred. The concept of measuring failures per bit transferred is a recurring theme with SSDs. Businesses measure failures per unit time, not per bit transferred. It’s also the way systems are engineered and warranted. We need to examine the impact the JEDEC specification has on failures per unit time.

The 5th row lists the typical duty cycle in each category averaged over a year. We now have enough information to compute the failures per unit time. I have chosen to express this as the mean number of years between non-recoverable read errors in the 6th row, and accounting for duty cycle effects in the 7th row. The interval between NRREs is greater than a year for all HDDs, reaching 40 years for performance-optimized enterprise HDDs. However, for both client and enterprise SSDs, the interval is substantially less than 1 year!

Speed Kills

I see this result as a major oversight in the creation of SSD specifications. If SSDs operate at these specifications, then we expect them to produce non-recoverable read errors more frequently than equivalent HDDs. I propose that the specification for NRRE (UBER for the SSD inclined) should be chosen such that the failures per unit time are the same as the equivalent class HDD. Thus, the last row shows my proposed scaled NRRE targets for SSDs. Note, SSDs are just as reliable as HDDs if they meet these specifications — they will be no more reliable. Claiming higher reliability will require more stringent targets than my proposals.

Now, some may feel that the above analysis is being rather harsh, and that JESD218 sets the UBER targets at end of life, thus the situation isn’t this dire. I would respond with the following points:

1. They left  this analysis out of the document, thus they haven’t shown they even considered it.
2. The UBER definition is hazy.
3. I am not aware of any published data (other than mine) that shows the UBER has actually been measured as function of device behavior. I will show such data in upcoming posts.
Notes:
1. I have seen failure analysis reports on many hard disk programs over  the years. Typically, mechanical related failures account for less than half the total. Electronic and microcode failures are quite common (ever had a cell phone break?). So, the non-mechanical argument needs to be backed up with field data.
2. JEDEC Solid State Technology Association “Solid-State Drive (SSD) Requirements and Endurance Test Method”, JESD218 Sep. 2010.
I have seen failure analysis reports on many hard disk programs over  the years. Typically, mechanical related failures account for less than half the total. Electronic and microcode failures are quite common (ever had a cell phone break?). So, the non-mechanical argument needs to be backed up with field data.
JEDEC Solid State Technology Association “Solid-State Drive (SSD) Requirements and Endurance Test Method”, JESD218 Sep. 2010.