A philosophical point
When testing devices, it is usually best to test beyond the specifications.
This is especially important when architecting systems.
This allows for a better understanding of behavior in the region of the failure limits.
I also find it better to interpolate behavior than to extrapolate it.
Finally, we may learn something interesting!
The test methodology
I have measured the error rate behavior for a set of MLC NAND SSDs with real-time aging. I have looked at 5xnm, 4xnm and 3xnm class consumer-grade devices. I haven’t had the chance to measure 2xnm devices yet. I chose consumer-grade flash as this is the dominant flash on the market, and one of the goals is to learn how to use such devices in enterprise systems.
I have created a test suite to measure the error rate surface in multiple dimensions. It can measure the bit error rate as a function of the data age (retention), the P-E cycle count (endurance), the number of reads since the last write (read disturb) and temperature. I’m not looking at other effects such as write disturb here. Off the shelf SSDs with a special micro code load are used. The special features include: turning off wear leveling, providing direct read, write and erase, and raw (no ECC) read. The latter allows a bit-for-bit compare to be used to determine the bit error rate. The SSD controller still determines how the write, read and erase operations are performed. All the SSDs were purchased through retail channels, thus should be representative of what is available.
In the test procedure, the cycling and aging are all performed at the same temperature, as is likely to be the case in an enterprise application. This is distinct from many other test methods, where the device is cycled at room temperature, and then the retention is measured at high temperature. For example, the JEDEC JESD218 specifies this type of test. I don’t feel that this approach adequately reflects the operational environment. Enterprise SSDs are are typically installed in rack mount systems in a data center, where the temperature is controlled. I think the test method should replicate the operating conditions as closely as practicable. A further concern wit the JESD218 is the assumption that accelerated testing is valid, and that the Arrhenius model is valid with a 1.1eV activation energy. You can tell from the surface equation of chapter 5 that the observed temperature dependence is not Arrhenius. I will cover this in detail in a later chapter.
In enterprise applications, SSDs aren’t usually “rode hard and put up wet”. That is to say, they are aren’t used heavily and then left idle. This might be more true of a light IT workload, such as a laptop, where a device might be used for a short period,with long idle periods. However, if the laptop was used hard, then shut down, it might be that the SSD is “put up wet” in that it has no time to perform clean up post activity once the power is removed. It might also heat up for a while as the cooling fans are turned off.
When I presented some of this data at the Non-volatile Memories Workshop in March 2012, I had a couple of people from flash vendors offer the opinion that my test methodology was flawed. I won’t identify them or their companies, but they said measuring non-recoverable read errors (sector failures) was not a valid method for determining SSD reliability. It is clear to me that measuring sector failures is a valid measure of system reliability. NRRE tests of storage media reliability in systems has been used for decades (hard disk, tape, etc.). (If a retention failure doesn’t result in a sector loss, then what exactly is the error?) I bring this anecdote up because I think it will help my readers understand why there may be large differences between how the SSDs behave and how they are specified. If the manufacturers don’t believe measuring non-recoverable read errors is valid, then perhaps they themselves don’t do it. This by itself could explain how the measured behavior deviates from the specification.