Flash Temperature Testing and Modeling

Temperature Testing

This an expanded version of the temperature modeling section from my Flash Memory Summit 2014 Tutorial T1.

An accurate temperature model is vital for flash devices as most vendors rely on accelerate temperature testing to verify retention capabilities. I tested flash at the SSD level as this is how the devices are integrated in to storage systems. All tests were performed using devices supporting a host-managed interface.

As such, I limited the temperature testing range to 30C to 100C. The SSDs had an operational temperature limit of 70C, and most testing was done below this range. Once devices were tested above this spec they were retired as it is possible that other components in the SSD were stressed.

One very important point is that the devices were operated and aged all at the target temperature. This is in contrast to most tests, where the device is only aged at high temperature to test the retention. However, devices in storage systems experience the operational temperature at all times. So, there may be temperature induced stresses that are missed when only aging at high temperature.

I used a temperature controlled oven with temperature probes to monitor the real-time temperature. The oven was equipped with feedthroughs for the interface and power cables.

Host Managed Interface

All the data here was obtained using SSDs supporting a host-managed interface (HMI). This interface provides direct, physical block addressing to the host. The host is responsible for all wear leveling actions. The host has control of read, write, erase and raw read (without ECC correction applied). The accesses are over the standard device interface (e.g. SATA). Such an interface is vital for testing of flash devices by system integrators, and is of significant value during operation as well. I think it should be a standard as systems using it can be faster, more robust and better maximize the capabilities of the SSDs.

All the data writes were performed as full erase block stripes, and always written in page-sequential order.

Spoiler: The Temperature Model doesn’t turn out to be  Arrhenius!

The Arrhenius Temperature Model

The Arrhenius model is widely assumed to be accurate for NAND flash. For accelerated temperature testing, the model is expressed as an acceleration factor on the time to failure:

$$ a_f = exp {\left( {\frac {-E_A}{k} \left( \frac {1}{T1} – \frac {1}{T2}\right) }\right)}$$ Eqn. 1
EA is activation energy in eV
k is Boltzman’s constant
T1 is the first temperature in K
T2 is the second temperature in K

The Arrhenius model assumes there is a single activation energy responsible for device failure. In NAND flash, charge de-trapping is  widely modeled as EA = 1.1eV. With this value, a 125C test would have a 936x acceleration factor to a 55C operation temperature. This means that a test for 1 year retention at 55C would require only 9 hours at 125C.  Take note – we will revisit this point later!

Failure Definition

In a flash cell, failure occurs when the detected value of a bit changes from it’s intended value (up or down). In an SSD type device, the sector ECC protects against some number of bit changes in the sector. Here, a failure is when a sector is a lost (an NRRE) due to more bits changing  than the ECC can correct. You can also think of failure  as when the bit error rate (ber) for a sector exceeds the target value that the ECC can correct.

In looking at eqn. 1, it should be apparent that the number of bits of correction in the ECC isn’t a parameter. Therefore, the acceleration factor is independent of the bit error rate target chosen. Therefore, if we  see temperature behavior which doesn’t give a constant af with bit error rate, then the model can’t be Arrhenius.

One should always validate the temperature model!

Short testing times are a strong motivator for using the 1.1eV Arrhenius model in the face of evidence to the contrary. Other error mechanisms are known to exist, including stress induced leakage current. This is essentially where defects in the insulating layer line up to provide leakage paths. This has been observed to behave as if it had a negative activation energy, appearing to anneal out at higher temperatures. However, this by itself should be a warning that the Arrhenius model is not necessarily valid.

We should also consider whether observing Arrhenius at the gate level would result in Arrhenius behavior at the SSD level. There are significant differences in the way failures are measured. In an SSD, failure occurs when enough bits have changed to overwhelm the ECC. These will be the weakest bits, not the average bits! It could be the cells which have more defects, were weakly written, suffer more inter-cell interference, etc. Effectively, we are measuring the tails of distribution, whereas cell-level measurements tend to give us the mean. So, it isn’t clear that we would expect the NRRE to be Arrhenius even if the gate level is! Just another reason we need to confirm any temperature model.

To confirm the Arrhenius model, we will measure the time to reach a given bit error rate as a function of temperature. Remember that this will be the time to failure.

Measured Temperature Behavior of ber

The following are measured bit error rate vs data age curves taken at multiple temperatures. The device here is 3xnm, and was operating at it’s rated limit of 3,000 program-erase cycles.

Measured bit error rate at 100C, 3,000 PE cycles
Fig. 1. Measured bit error rate at 100C, 3,000 PE cycles. The X axis is the data age in hours. The Y axis is the log of the bit error rate. The dashed line indicates the bit error rate equivalent to the bit error correction limit  of the device ECC.

Figure 1 above shows the raw bit error data at a temperature of 100C and 3,000 program-erase cycles. The dashed line indicates the bit error rate equivalent of the error correction limit of the device ECC. The ber increases steeply at short data ages, then flattens off. It crosses the ECC limit at a data age of around 122 Hours. There are two gaps visible in the data. These are the idle intervals used to separate read disturb effects from aging effects. Note that the ber exceeds the ECC capability. The raw-read command of the host-managed interface allows reading the data with the ECC off. Comparison of the raw data with the written data allows data to be collected at error counts past the ECC correction capability.

Measured bit error rate at 100C, 3,000 PE cycles including fit.
Fig. 2 Measured bit error rate at 100C, 3,000 PE cycles. The x-axis is the data age in hours and the Y axis is the log of the bit error rate. The solid line indicates the functional fit to the data.

Figure 2 above adds the function fit of the form from eqn. 1 from SDD post 5. This fit is a power law in age and reads:

$$E = hA^kR^g + b_1$$ SSD 5 Eqn. 1 recap
A is the data age
R is the number of reads since the block was written
h is scaling term
k is the aging exponent
g is the read count exponent
b1 is a constant at a fixed cycle count

The solid line is the functional fit. It is a bit high out to about 30 Hours, but is quite good thereafter. It even does a good job of modeling the idle intervals.

Measured bit error rate at 90C and 100C, 3,000 PE cycles
Fig. 3 Measured bit error rate at 90C and 100C, 3,000 PE cycles. The x-axis is the data age in hours and the Y axis is the log of the bit error rate. The solid line indicates the functional fit to the data.

Figure 3 above adds the 90c data set and fit. Once again, the fit quality is quite good. Here, it takes about 400 Hours to reach the ECC limit. A close examination of the data here makes the case for Arrhenius behavior very tenuous. Given the shape of the curves and the change with temperature, there is no way that the ratio of times to a given ber is independent of the bit error rate chosen. Not convinced? Then let’s nail it down.

Measured bit error rate at 40C to 100C, 3,000 PE cycles
Fig. 4 Measured bit error rate at 40C to 100C, 3,000 PE cycles with fits.

Figure 4 above adds data from 40C, 60C and 70C and the associated fits. Note that the fits are again quite good, which is nice for a parametric model. In examining the data sets, it looks a lot like temperature shifts are primarily vertical offsets on the chart. Since this is a log chart, we have a linear scaling with temperature.

Measured bit error rate at 40C to 100C, 3,000 PE cycles with the 100C fit shifted to 40C.
Fig. 5 Measured bit error rate at 40C to 100C, 3,000 PE cycles with the 100C fit shifted to 40C.

Figure 5 above shows what happens when the 100C fit is shifted vertically to overlay the 40C data. The fit is also quite good. In fact, the only significant difference appears  to be due to changing behavior in the idle intervals. (A subject for a later post on read disturb!) So, we expect the following scaling law for the bit error rate as function of temperature (age and cycles held constant):

$$ber \left(T1\right) = m \times ber \left( T2 \right) + q$$ Eqn. 2
T1 is the first temperature
T2 is the second temperature
m is the scale factor
q is a constant
I covered one such example in my 2013 Flash Memory Summit talk on Retention and Endurance Monitoring.

Leave a Reply

Your email address will not be published.