Category Archives: Solid State Disk

Series SSD: 4. SSD Wear Leveling and Reliability

SSD Wear Leveling and Reliability

In the prior post, we explored issues with SSD NRRE specifications and testing. In this post, we’ll take a look at how the internal architecture of SSDs affects reliability and the ability of integrators to test flash.

NAND flash has a few limitations that must be overcome to create a useful storage device. The memory cells have finite write endurance and finite data retention, and don’t support direct overwrite. SSD vendors use a technique called wear leveling to address these issues. The goal is to increase the useful lifetime of the device. However, according to the 2nd law of thermodynamics, there is no such thing as a free lunch. (OK, that’s not precisely what is says, but this result can probably be derived from the 2nd law.) Thus, we should expect side effects! We shall examine these here in detail.

Continue reading

Series SSD: 3. The Problem with SSD Specifications

Common oversights with SSD specifications

It is commonly assumed that solid state disks are more reliable than hard disks because they have no moving parts. While the evidence for this is lacking, we’ll look at the impact of non-recoverable read errors on reliability. Continue reading

I have seen failure analysis reports on many hard disk programs over  the years. Typically, mechanical related failures account for less than half the total. Electronic and microcode failures are quite common (ever had a cell phone break?). So, the non-mechanical argument needs to be backed up with field data.

Series SSD: Solid State Disks in Storage Systems

I will cover a number of topics relating to solid state disks (SSDs) in storage systems in a series of posts. My hope is to stimulate discussion in the industry, and to encourage the open exchange of reliability information for solid state disks.

Disclosure: I previously worked in the HDD business, and have a background in solid state physics. However, I have been working on storage systems for at least a dozen years and have no vested interest in one technology over another.

Solid state storage is beginning to be widely deployed in IT storage systems. NAND flash has largely displaced hard disk drives (HDD) in mobile consumer applications, effectively eliminating the use of HDDs in the sub 1.8” form factors. Based on these events, it is commonly assumed that solid state storage is poised to make substantial inroads in the IT storage space. However, there are both technological and business constraints that dictate how this process will proceed. In this series, I will focus on the technological constraints as they relate to storage systems.

Reliability

Unfortunately, most of the industry seems to be more concerned with the performance of solid state disks, taking the reliability for granted. There is a dearth of published information on SSD reliability, and I aim to rectify that here. In this series, I will examine the technological impacts of solid state storage on systems. I will show how to determine which parameters are important for system reliability. Actual device data will be presented and the impact on reliability analyzed. We will find out if the reliability of SSDs is as great as widely assumed.

Testability

Another topic for exploration is the testability of SSDs. The current crop of devices leave much to be desired in this regard. In fact, I am amazed that system vendors who would never  accept a hard disk that they could not be tested, would willing accept such SSDs. Perhaps this is due to the small volumes of SSDs. However, I do not expect this situation to last, and believe there should be an industry wide requirement for fully testable SSDs.

Data

Over the course of the series, I will share a wealth of data that I have gathered from SSDs. Much of this data will be raw, and I haven’t had time to analyze all of it. Thus, I encourage the community to help in this regard. It is my hope that this presentation will be valuable to the industry, and I encourage SSD vendors and flash manufacturers to openly share their data.

I will not identify particular component vendors here – so don’t ask me to. It isn’t my intention to cast aspersions on any manufacturer. I would much prefer them to be more open with such data. I have looked at a number of vendors’ parts, so if you recognize yours, you are welcome to  share this information.

Outline

I plan to post a new section each week. Here is the current planned outline for this series, although this is subject to change:

  1. Hard errors and reliability – things that go bump in the drive
  2. System level reliability targets – it’s not what you think
  3. The problem with SSD specifications – speed kills
  4. SSD wear leveling and reliability – in flash errors never sleep
  5. The error rate surface for multi-level cell flash – and an equation in 4 dimensions
  6. The test methodology – how we get all the data
  7. Bit error rate: cycling data (endurance) – we see error fountains
  8. Flash temperature data – differences from the Arrhenius model
  9. Bit error rate: aging data (retention) – ski jumps
  10. Bit error rate: dwell time – into the fifth dimension
  11. Error uniformity – is the the mean meaningless?
  12. Error fountains –  wronging a few writes
  13. Error uniformity – population effects
  14. Caching with MLC flash – playing to your strengths
  15. A proposal for a testable flash device

I will continue to to run reliability test on SSDs during the course of this series, and am open to suggestions for modifications to the tests. I have much more data than I can possibly share on this site. If anyone in academia would like to help analyze the data, please contact me and we’ll see if we can arrange something.

Some of this information has been previously presented in my tutorials at  FAST ’11 and NVMW 2012 .