Series SSD: 4. SSD Wear Leveling and Reliability

Blocks at risk due with wear leveling

One issue with wear leveling is that it assumes the end of life cycle count is either known a-priori, or can be determined during operation. The shape of the wear level access histogram creates a side-effect that can impact the data reliability of the device. Let’s examine the situation by examining the P-E histogram and the CDF. The P-E histogram shows the number of blocks at a given P-E cycle count. The CDF shows the probability of a NRRE for a block at a given P-E cycle count.

The CDF gives the percentage of blocks at or below a given P-E cycle count have failed. The histogram gives the percentage of blocks at a given cycle count.

Exposure to non-recoverable read errors occurs when the tail of the P-E cycle histogram crosses the CDF. The shaded curve is the product of the P-E histogram and the CDF. Thus the area is an estimate of the blocks at risk. (The idea here is to get a quick feeling for the situation, not to do complex math.)

Blocks at risk 1
Figure 7. Overlap of the P-E cycle histogram and the CDF. The red curve is the wear leveled access histogram. The blue curve is the CDF at some data age. The area shaded red represents the blocks at risk. Mouse over to see annotations.

This is shown in Figure 7. The important feature to notice is that the shape of the P-E cycle histogram determines how rapidly the NRRE exposure increase with cycle count. The raw histogram encounters the CDF earlier, but the rate of growth of blocks at risk is slower. The steep high-cycle count side of the wear leveled histogram causes a more rapid uptake in blocks at risk for NRRE. The steeper the tail, the more rapidly the at risk area (and thus the error exposure) grows.

“Improved” wear leveling

Ironically, the better the wear leveling system is at increasing the total device P-E cycles, the more sudden the onset of the wear out overlap becomes.

Blocks at risk 2
Figure 8. Comparison of blocks at risk with two different wear leveling approaches. The red curve shows a P-E cycle histogram for a first approach, and the orange curve the histogram for an improved approach.The blue curve is the CDF at some data age. The red area is the blocks at risk for the first approach, and the orange shaded area the blocks at risk for the improved approach.

Figure 8 shows a comparison of a first wear leveling method and a second improved method. The second method is deemed to be improved since the P-E cycle histogram has a narrower width. Thus, this method will be able to extract more total P-E cycles from the device before the first block reaches the P-E cycle target. However, this presumes that the P-E cycle target is properly chosen. There is not much field data, and little published information on how each vendor chooses this target. As can be seen from figure 8, the second method will have more exposure to non-recoverable read errors than the less efficient method as it approaches the CDF.

The “perfect” wear leveling algorithm from a pure cycle count perspective would be a delta function, with all the blocks at the same cycle count. Thus, the device would deliver the full cycle count capacity (P-E cycle limit x blocks). However, this would also have the most sudden onset of the overlap, which is given by the CDF at the P-E cycle count.

Recall I said there was no free lunch here! Wear leveling improves the early life behavior by reducing the impact of high write rate blocks. However, it creates issues as the device is used, since it makes detection of wear out much more critical. The rate of growth of the blocks at risk with cycle count determines how precise the knowledge of the CDF must be.

Wear-out detection affects reliability

The situation with wear out detection is complicated by the fact that the exposure changes as the data ages. Thus, the population of blocks at risk grows silently as the data ages. We can see this in Figure 9.

Blocks at risk 3
Figure 9. Effect of data aging on blocks ate risk. The red curve shows a P-E cycle histogram for a first approach, and the orange curve the histogram for an improved approach. The blue curve is the CDF at a first data age. The cyan curve is the CDF at an older data age.

Compare the shaded areas of Figure 8 with those of figure 9. The difference is that areas for Figure 9 are for an older data age, thus they are larger. As you can see, the rate of growth of the overlap is much faster for the narrower distribution.

The more astute among you may have noted that the access histogram is likely to have a spread of data ages, thus there is a set of CDF curves that determine the overlap. This is indeed the case. Many wear leveling systems are designed to minimize the spread of ages, and to limit the maximum age. This can be accomplished by relocating (rewriting) data after a certain age, either explicitly as a separate process, or just as part of the normal wear-leveling procedure. Sadly, the free lunch issue comes up again, as this is achieved by rewriting the oldest data, which eats into the P-E cycles. This makes it one of the contributors to what is called write amplification. Further, it requires the device to be powered on.

As the flash device geometry shrinks, the cycle count for the cells decreases. Such behavior will put additional pressure on wear leveling to extract the maximum cycle life from devices. This will mean running closer to the cliff in the CDF (where the blocks at risk hit the NRRE target). Thus, the determination of the CDF must be more precise as well.

Leave a Reply

Your email address will not be published.