Vol. III · No. 24The weeklyMarkets · PomegraAI Digest · AI/TLDR
Skip to content

An Evening with a Failing Flash Chip

The smart thermostat on the bench in front of me has been, in a slow way, dying for the last fourteen months. Its owner, who is a colleague of mine who lives in a flat in Whitefield, brought it over last weekend because the thermostat had begun, in his words, to lose its mind every few days. It would reboot, in the middle of the night, and come up with all of its programmed schedules wiped, the clock reset to January 1st, 2020, and a small red LED blinking on the front panel that indicated, depending on which paragraph of the user manual you read, either a network fault or a hardware failure.

My colleague had, before bringing it to me, done all of the things a reasonably technical end user would do. He had power-cycled it. He had factory-reset it. He had updated its firmware, three times, hoping that one of the updates would address the problem. He had even, at one point, replaced the lithium coin cell that backs up the RTC, on the theory that the clock-reset symptom suggested a battery problem. The thermostat had continued, every few days, to lose its mind.

I had a fair guess what the problem was before I opened the case. The wear-out behaviour of NOR flash, in a device that writes to it as often as a thermostat does, is something I have seen perhaps thirty times in the field, and it has a particular signature: the device works, until it does not, and the not-working manifests as the loss of recent state in a way that suggests the flash is no longer accepting writes reliably. The thermostat's symptoms fit the signature exactly. I opened the case to confirm.

The thermostat's small flash chip

The thermostat is built around a Nordic nRF52840, a small ARM Cortex-M4 with integrated Bluetooth and a generous amount of on-chip flash. It also has, mounted next to the SoC, an external 4 megabit SPI NOR flash from Macronix — the MX25L series, in the SOP-8 package, that one sees in approximately half of the consumer IoT devices on the market. The external flash is, on this device, where the thermostat stores its schedules, its calibration data, its connection history, and a small ring buffer of recent operational logs. The on-chip flash is reserved for the firmware itself.

The schedules and the logs, I had assumed before opening the case, were the things being written to often enough to wear out the part. I was, in this assumption, half right.

The MX25L's datasheet specifies, for each of its 4096 erasable sectors, an endurance of one hundred thousand erase cycles. This number is, by the conventions of the industry, a minimum, and is given at twenty-five degrees Celsius. Real-world endurance, at higher temperatures and across the manufacturing variation of any actual batch of parts, is usually somewhat better — perhaps two to four hundred thousand cycles for any individual sector, before the part begins to fail in a statistically detectable way.

A thermostat that writes its operational state to flash once a minute, in a single sector, will reach one hundred thousand cycles in approximately seventy days. The thermostat in front of me had been in service for fourteen months. The math, if the schedules and the logs had been written to a single sector each minute, would have predicted total failure around month three.

The thermostat had survived for fourteen months, in part, because its firmware did the thing one is supposed to do with NOR flash: it spread the writes across the available sectors, using a small home-grown wear-leveling layer, so that no single sector accumulated cycles faster than the others. The wear-leveling layer was not particularly sophisticated, but it was adequate to the task, and it had done its job for four hundred and twenty days before the wear hit the threshold at which the failures started to be visible.

What "failure" actually looks like

When a NOR flash sector reaches the end of its endurance, it does not stop working all at once. It begins, instead, to fail in a particular and irritating way. The erase operation, which is the part of the cycle that does the most physical damage to the floating gates in the flash cells, becomes unreliable. The chip's status register will report that the erase has completed successfully, but a subsequent read of the sector will reveal that some of the bits are not, as they are supposed to be, all ones. They are, instead, ones with a small admixture of zeroes, in a pattern that grows worse with each successive cycle.

If the firmware does not verify, after each erase, that the sector has actually erased, the firmware will write a fresh block of data into a sector that contains residual zeroes, and the data will be corrupted in a way that depends on the exact pattern of the residual zeroes. If the firmware does verify, it will detect the failure and (in a well-designed system) mark the sector as bad and migrate the data to another sector. The thermostat's firmware, as I confirmed by reading the disassembly, did the verify but did not have anywhere to migrate the data to, because its wear-leveling layer had no provision for retiring sectors. When a sector failed the verify, the firmware logged an error to its operational log (which was, itself, in another wearing-out sector) and continued, on the next minute, to attempt to write to the same failed sector. The eventual reboot was caused by a different problem entirely — a memory corruption that I will get to in a moment — but the reset of the schedules at boot was caused by the wear.

The other thing the firmware was writing

I had, when I started this exercise, assumed that the schedules and the logs were the high-write data on this device. They were not. The thing the firmware was writing most often, by a wide margin, was a small calibration block — about thirty-two bytes — that updated every time the device received an over-the-air sensor reading from a paired temperature probe in the next room. The OTA reading came in roughly every fifteen seconds. The calibration block was written, on every reading, to a fixed location in the flash.

The fixed location, I want to emphasise, was not subject to the wear-leveling layer. The wear-leveling layer applied only to the schedules and the logs. The calibration block was treated, by the firmware author, as something that "rarely changed," and was therefore left in a fixed location for simplicity. In the original design, the calibration block was updated only when the user manually recalibrated the sensor. In a later firmware revision — added, I am fairly confident, by an engineer who had not read the original design notes — the calibration block had become a place where the device cached its latest sensor reading for fast access on reboot.

This is the kind of mistake I see, in firmware reviews, perhaps once a quarter. The original author of the storage layer makes a careful decision: data that changes often goes through the wear-leveler, and data that rarely changes goes to a fixed location. The decision is correct in its context. Six months later, a different engineer, working on a different feature, finds the fixed location and uses it for a different purpose, without realising that the rules have changed. The change is small. It does not, in the moment, cause any visible problem. The flash, however, knows what has happened, and is patient.

In the case of the thermostat in front of me, the calibration block had been re-written, by my calculation, approximately 2.4 million times in fourteen months. The sector containing it had failed at roughly cycle 280,000 — about 116 days in — and had been writing into a sector with a growing number of stuck bits ever since. The corruption had eventually propagated into a neighbouring data structure that the firmware used to validate its memory, which had triggered the reboot, which had reset the schedules, which had prompted my colleague to call me.

On the small lies of endurance specs

The MX25L datasheet, as I noted, specifies one hundred thousand erase cycles per sector. The number is, in the strict sense, accurate. It is also, in the way it is usually invoked in firmware design discussions, slightly misleading.

The way the number is usually invoked is something like this: "the flash can take a hundred thousand writes, so we can write once a minute for sixty-nine days before we have to worry." This is the calculation I have made, in early design meetings, more times than I can count. It is wrong in two ways. First, it conflates writes with erases; in NOR flash, a single sector can accept many writes between erases (the granularity of programming is much smaller than the granularity of erasing), so the relationship between write rate and erase rate depends on the data pattern. Second, and more importantly, it treats the hundred-thousand-cycle number as a wall, rather than as the floor of a distribution.

The actual behaviour, as I have observed it across perhaps fifty devices, is that NOR flash begins to fail in subtle ways at somewhere between thirty and seventy percent of the rated endurance, that the failures are concentrated in a particular subset of the sectors (the ones with the most marginal cells, from the original manufacturing), and that a well-designed firmware will tolerate these failures and continue to function, while a poorly designed firmware will fall over visibly within weeks of the first failure.

The thermostat in front of me is an example of the second kind. The firmware was, in many ways, well written. The wear-leveling layer was thoughtful. The CRC checks were diligent. The OTA update path was well-tested. But the firmware had been extended, over its life, by people who did not have a full picture of how the storage layer worked, and the extensions had, slowly and quietly, eaten through the endurance of one particular sector.

What I told my colleague

The thermostat is, in any practical sense, dead. The flash chip is replaceable in principle — it is in a SOP-8 package, on the top side of the board, with a hot-air rework station I could swap it out in twenty minutes — but the replacement chip would need to be re-programmed with a calibration block whose original contents I do not have, and the device would, in any case, continue to be written to at the same rate, and would fail again, by my calculation, in another fourteen months.

I told my colleague this. I also told him that the manufacturer of the thermostat had, three months earlier, released a firmware update that moved the calibration block into the wear-leveled storage region. The update notes for that firmware, when I dug them up, did not mention the change. It was described as "improvements to reliability." I have a guess as to why the change was made, and a guess as to how the manufacturer figured out it needed to be made, which I will leave to the reader.

My colleague has, in the meantime, gone back to the older mechanical thermostat that the smart one replaced. The mechanical thermostat does not have a flash chip. It does not have a clock. It does not need a firmware update. It will, by my estimation, outlive me.

A note on the prose. The reporting in this magazine is one writer’s — mistakes are her own, and corrections are welcome. Where code appears, it has been tested on at least the hardware named. Where photographs appear, they are the author’s unless captioned otherwise.