Series SDD 5: The error rate surface for MLC NAND flash

Some measured surfaces

Now let’s take a look at some measured surfaces. For the test presented here, the cycling and aging are all performed at the same temperature, as is likely to be the case in an enterprise application.

Log view

Figure 3 below is a log view of a measured surface at 70C from a 3xnm flash device. This  surface was measured out to 12,000 cycles and 175H. Ages after 175H are extrapolated via equation 4.

ber_sl2
Fig. 3. Relative bit error surface from a 3xnm class flash device at 70C. The horizontal axis is the age of the data since written in hours. The depth axis is the number of P-E cycles for the data. The vertical axis is the relative bit error rate, where the bit error rate at a P-E cycle count of 1 and an age of 0 hours is 1. All axes are logarithmic.

While it may seem remarkable how similar this surface is to the expected surface, it may not be that unexpected. After all, this is a log-log-log plot, so many functional forms will look similar on such a chart. (A case in point is that the functional form used to create figure 1 is not equation 1.) Nonetheless, it is nice to see that it behaves much as expected. Unfortunately, it is easy to see how many orders of magnitude the bit error rate increases from 1 P-E cycle and short age to 3,000 cycles and 1 year (8,760H) – about 20,000x.

3xnm surface details

From this point on, we will look at the surface using linear axes for the age and P-E cycles so that it’s easier to see the details of the shape.

The 3xnm devices specify 1 year retention at 3,000 P-E cycles and a UBER of 10-15. This UBER corresponds to a sector failure rate of 4.1×10-12 for 512 byte sectors. These 3xnm devices employ a BCH 15 error correction code at the sector level sector. This code adds 195 check bits to each sector, and can correct any 15 error bits from the total 4,291 bits. This corresponds to a bit error rate of 3.4×10-4, assuming errors are random. Maximum operating temperature is specified as 70C.

ber_s3
Fig. 4. A 40C bit error rate surface from a 3xnm device. The horizontal axis is the age of the data since written in hours. The depth axis is the number of P-E cycles for the data. The vertical axis is the relative bit error rate. The 3,000 cycle grid line is highlighted in yellow, and the bit error rate contour corresponding to the UBER is highlighted in white.

Figure 3 shows the surface of a 3xnm device at 40C. The grid line for 3,000 P-E cycles is highlighted in yellow. The contour for a bit error rate of 3.4×10-4 is highlighted in white, which corresponds to the UBER target. This plot goes out to 2,000 hours, and you can see that the bit error rate at 3,000 cycles is getting close to the bit error rate limit even at this sort of an age.

The surface here was measured out 14,000 cycles and 200 hours. Ages longer than this are extrapolated. Even so, we can see that this device is not going to meet it’s specifications.

Reliability given the surface

As we worked out above, the target bit error rate for this device is 3.4×10-4, corresponding to the UBER specification of 10-15. It is instructive to use the bit error rate surfaces to determine the actual data age limit for a device at the specified cycle count. We’ll use this technique later to study the value of designing a storage system to maximize the utilization of the flash devices.

ber_s4
Figure 5. Bit error rate surface showing the retention limit at the specified UBER of 10-15 and 3,000 cycles at 40C.

Figure 5 shows the surface for another 3xnm device. The UBER target contour is highlighted in white, and the 3,000 P-E cycle limit is highlighted in yellow. The data age where these two lines cross is the maximum allowable retention at this temperature (40C). The figure shows that instead of meeting the manufacturer’s specification of 1 year retention at 70C, this device only delivers 1,900H retention (22% of the rate life at 70C).

ber_s5
Figure 6. Bit error rate surface showing the retention limit at UBER=10-17, which would provide the same reliability as a hard disk.

As I showed in post 3, the UBER specification required to match HDD reliability for this market segment in 10-17. The corresponding bit error rate for this device is 2.5×10-4. One reason that SSD manufacturers haven’t taken to using my proposed specification should be obvious from Figure 6. The measured retention limit drops from 1,900 hours to only 1,300 hours.

Let’s examine the reliability in more detail. This device clearly doesn’t meet it’s specified targets. As explained previously, The bit error rate (ber)at 3,000 P-E cycles and 8,760 hours should be 3.4×10-4. However, for this surface we get a ber at 3,000 P-E cycles and 8,760 hours of 1.3×10-3. This is nearly 4 times the ber corresponding to the specified UBER. At this bit error rate, the sector failure rate is 2.4×10-4 and the UBER is 6×10-8 which is 60,000,000 times higher than specified!

Important questions for SSD vendors include how will the device behave if it performs as we have seen here? Will it be able to determine that it won’t meet it’s endurance/retention specifications? If it can, what action will it take? Will the user be informed (e.g. – via S.M.A.R.T.)? Will the device behavior change? If not, there is no way for an end user to detect this has occurred for a wear leveled SSD. What will be observed is an increasing probability of sector failure, which can become quite large, and may not be detected until it is too late.

Leave a Reply

Your email address will not be published.