I will cover a number of topics relating to solid state disks (SSDs) in storage systems in a series of posts. My hope is to stimulate discussion in the industry, and to encourage the open exchange of reliability information for solid state disks.
Disclosure: I previously worked in the HDD business, and have a background in solid state physics. However, I have been working on storage systems for at least a dozen years and have no vested interest in one technology over another.
Solid state storage is beginning to be widely deployed in IT storage systems. NAND flash has largely displaced hard disk drives (HDD) in mobile consumer applications, effectively eliminating the use of HDDs in the sub 1.8” form factors. Based on these events, it is commonly assumed that solid state storage is poised to make substantial inroads in the IT storage space. However, there are both technological and business constraints that dictate how this process will proceed. In this series, I will focus on the technological constraints as they relate to storage systems.
Unfortunately, most of the industry seems to be more concerned with the performance of solid state disks, taking the reliability for granted. There is a dearth of published information on SSD reliability, and I aim to rectify that here. In this series, I will examine the technological impacts of solid state storage on systems. I will show how to determine which parameters are important for system reliability. Actual device data will be presented and the impact on reliability analyzed. We will find out if the reliability of SSDs is as great as widely assumed.
Another topic for exploration is the testability of SSDs. The current crop of devices leave much to be desired in this regard. In fact, I am amazed that system vendors who would never accept a hard disk that they could not be tested, would willing accept such SSDs. Perhaps this is due to the small volumes of SSDs. However, I do not expect this situation to last, and believe there should be an industry wide requirement for fully testable SSDs.
Over the course of the series, I will share a wealth of data that I have gathered from SSDs. Much of this data will be raw, and I haven’t had time to analyze all of it. Thus, I encourage the community to help in this regard. It is my hope that this presentation will be valuable to the industry, and I encourage SSD vendors and flash manufacturers to openly share their data.
I will not identify particular component vendors here – so don’t ask me to. It isn’t my intention to cast aspersions on any manufacturer. I would much prefer them to be more open with such data. I have looked at a number of vendors’ parts, so if you recognize yours, you are welcome to share this information.
I plan to post a new section each week. Here is the current planned outline for this series, although this is subject to change:
- Hard errors and reliability – things that go bump in the drive
- System level reliability targets – it’s not what you think
- The problem with SSD specifications – speed kills
- SSD wear leveling and reliability – in flash errors never sleep
- The error rate surface for multi-level cell flash – and an equation in 4 dimensions
- The test methodology – how we get all the data
- Bit error rate: cycling data (endurance) – we see error fountains
- Flash temperature data – differences from the Arrhenius model
- Bit error rate: aging data (retention) – ski jumps
- Bit error rate: dwell time – into the fifth dimension
- Error uniformity – is the the mean meaningless?
- Error fountains – wronging a few writes
- Error uniformity – population effects
- Caching with MLC flash – playing to your strengths
- A proposal for a testable flash device
I will continue to to run reliability test on SSDs during the course of this series, and am open to suggestions for modifications to the tests. I have much more data than I can possibly share on this site. If anyone in academia would like to help analyze the data, please contact me and we’ll see if we can arrange something.