Series SDD 5: The error rate surface for MLC NAND flash

Discussion

With this device, obtaining 1 year retention at the specified UBER will limit the P-E cycles to about 1,000. This SSD might be able to meet it’s target specification if it were to relocate blocks well before the data has aged 1 year. This approach only works when power is applied, will require a real time clock to determine the actual data age, and reduces the available P-E cycles. The device doesn’t specify if it performs such relocation, however.

I should also point out that the surface of Figures 5 and 6 isn’t even at the limit of the temperature range! Thus, the device is even farther from meeting its specification than we’ve determined here.

Unfortunately, this device doesn’t appear to be unique. I have measured quite a number of SSDs, and not one of them has met its stated UBER specification at the P-E cycle and data age limits. It is interesting to speculate as to why this might be the case.

Possible explanations

Conjecture 1. The device is faulty. Thus the surface is not representative.

Response 1: The device did pass whatever tests the manufacturer performed, since it was purchased though a retail channel. Plus, I have seen similar behavior with a number of devices. Thus it does seem representative, or I’ve been rather fortunate to have selected a set of defective devices. This experience begs the question of how an end user could have determined that such devices were faulty.

Conjecture 2. The device was fine when it shipped.

Response 2: So, SSDs can go bad between ship and install, thus we have some new failure mechanism to discover… And this mechanism causes SSDs to age prematurely.

Conjecture 3. The long term retention is different. These devices were only measured out to a few hundred hours of retention, and the extrapolation is faulty.

Response 3: Fair enough, the long term retention was extrapolated. However, so are most manufacturers’ tests. It’s not likely that they test to 1 year actual retention (chime in vendors if you do real time aging – I’d love to know your test methodology), so their data is subject to the same criticism. I have taken the liberty of publishing the data openly so you can see it and critique it – the manufacturers haven’t. Until such time as they openly publish their reliability tests, I wouldn’t assume their data is more accurate. Thus, there is no basis to assume they have a more accurate method of extrapolating the retention.

Conjecture 4. The device was P-E cycled too rapidly in the test, and the access method isn’t representative of how the device is used.

Response 4: SSDs are deployed to improve performance, thus vendors need to assume that they will be used to their maximum extent. Another point to ponder – if this test is deemed to be pushing the device too hard, then we should be concerned about our ability to test these devices (by we I include the manufacturer).

Conjecture 5. This as an older generation (3xnm) device. The newer devices are likely to be better, and they have stronger ECC.

Response 5: These devices still didn’t meet their specifications. Still, it is likely that improvements have been made in many areas in newer generation devices. However, as the device geometry shrinks, it will take significant gains just to maintain the behavior of the prior generations. For example, increasing the ECC from 15 bit correction to 30 bit correction in the example of Figure 6 would allow it to just meet the 1 year retention specification. However, this doesn’t mean a 30 bit ECC would deliver the same results at smaller geometries.

Conjecture 6. The surface is accurate.

Response 6: If this is the case, and I believe that it is, we need to understand why the behavior is so different than expected. One might argue that this is fine with the manufacturer, since with wear leveling, the device is likely to fail out of warranty. However, I think the situation arises from a profound misunderstanding of the reliability characteristics of MLC flash, as we will explore in the next posts. When I show the temperature model it will highlight the inadequacies of the standard accelerated test procedures.

Parting thoughts

I am curious as to why SSD manufacturers and flash suppliers are so secretive about their reliability data. It makes me wonder if they have something to hide, or if they simply haven’t explored it thoroughly. I am also troubled that the trade press seems to only care about device performance when they review SSDs. I think they would do well not to assume that devices are necessarily reliable, and start performing their own tests.

In the next post I will go into details of how the test data is generated. After that we’ll begin to get into the specifics of the device data. I hope you’ve found this instructive so far, and are adequately prepared for what is to come.

Leave a Reply

Your email address will not be published.