Update July 2014: modified bit-error equation
I have modified the MLC equation based on examination of further data.
Expected behavior of the MLC error rate surface
NAND flash has rather complex bit error rate behavior compared with magnetic recording. In the case of hard disks, the bit error rate behavior tends to be a constant, without strong dependence on other factors. For a given bit, the error rate doesn’t depend on the number of write cycles or the age of the data. Unfortunately, the same can’t be said for flash. Flash has a complex multi-dimensional error rate surface. Some of the factors contributing to the bit error rate in NAND flash include the program-erase (P-E) cycle count, the age of the data (time since it was written), temperature, and read disturb.
Figure 1 shows an expected relative bit error rate surface for MLC flash in terms of data age and Program-Erase (P-E) cycle count. I anticipated the surface having such a shape based on the behavior of flash. An MLC device can be driven to failure at short data ages simply by cycling. This can be seen as moving in depth along the left side of the chart in Figure 1. As the P-E cycle count increases, the bit error rate can grow high enough that the it becomes uncorrectable. The point at which this occurs depends on the ECC employed. You can see this annotation if you mouse over the figure.
However, as the data age increases, then the error rate must increase faster as the P-E cycle count increases. Thus, along the right side of the chart, the bit error rate must increase faster than along the left side. Again, if you mouse over the figure you can see the annotations highlighting this. So this is roughly what one should expect for bit error rate as a function of age and P-E cycle count.
Field knowledge of the surface
Flash is presumed to be very reliable because it has no moving parts and that it appears to perform well in the field. In 2008, essentially all flash was deployed in consumer applications, and this is still essentially true today. We need to exercise a good deal of caution when applying field experience in consumer applications, such as smart phones and cameras, to enterprise applications. For your consideration, I have highlighted the region of the bit error rate surface that consumer applications experience in Figure 2.
Consumer applications are very different from enterprise applications. Consider a 10 megapixel camera using a 4GB flash card. The average image size of a JPEG image might be about 4MB. 1,000 such images can fit on the card. Filling the card 10 times (a P-E cycle count of 10) requires 10,000 images. If the average consumer took 10 pictures a day, it would require 5 years to reach this point. Thus the digital photo consumer isn’t going to push flash to a high cycle count. Note, that many consumer-level DSLR cameras only have a rated shutter life of 50,000 images, which would be only 50 fills of the the example flash card.
Another common application is syncing a smartphone or a tablet. If we assume a heavy user updates 5% of the capacity a week, then it will take 10 weeks to do a full cycle of the device (assuming that the device has a write amplification factor of 2x). Thus we get 5 full cycles per year. The device would have to last 5 years for the number of fills to even reach 25. It is not likely that most consumer devices are kept this long.
Consumer behavior should be considered as well. Many flash storage devices are inexpensive (USB keys, flash cards), and have short warranties. Given the rate of decline in the cost per GB of flash, it is very likely that most of these devices aren’t in service very long. Further, it is likely that many of the devices which fail are simply disposed of, not returned. Thus, the failures won’t be seen by the manufacturers. These two factors (early retirement and unreported failure) combine to give an false impression of higher reliability. Thus, looking at device return failure rates won’t provide an accurate measure of field reliability. I have seen HDD manufacturers make this same mistake.
It seems difficult for a consumer application to put significant stress on the P-E cycle count of flash storage. I have tried to illustrate this as the color overlay in figure 2. As I pointed out in post 3, enterprise applications using flash storage push them to the performance limit. This is because the criteria used to select flash is quite different for these applications. Flash is selected for consumer applications because it is less expensive in small capacities than HDD storage. The least expensive HDD is around $50, but flash can be purchased in small quantities for much less. I call this “cost of first byte sensitivity”. In the enterprise, flash is selected for performance, as opposed the cost of first byte.
Since the selection criteria for flash are quite different for consumer and enterprise applications, it shouldn’t come as a surprise that the flash usage should be different as well.