Vol. III · No. 24The weeklyMarkets · PomegraAI Digest · AI/TLDR
Skip to content

The Bootloader That Outlived Its Author

The piece of code I want to write about today was committed to a private SVN repository on the afternoon of November 14th, 2009, by an engineer whose name I will not repeat, in a small office above a bakery in the Koramangala neighbourhood of Bangalore. The commit message read, in its entirety, "first cut, works on the dev board, needs review." The engineer was twenty-six years old. The bootloader he had written was four hundred and eleven lines of assembly and C, targeting a now-discontinued ARM7TDMI part, and was intended to bring up the first hardware revision of a small industrial sensor that the company planned to ship the following spring.

The sensor shipped. The bootloader was, in the way these things go, never reviewed, because the engineer was promoted off the project within six weeks, and the engineer who replaced him had three other projects on his desk and trusted, reasonably enough, that the bootloader that was bringing up the device every time it powered on was probably good enough. Over the next five years, the bootloader was ported to two new chip families, expanded to handle a more sophisticated update format, and pressed into service for four additional product lines. The core of it — the first hundred lines, the part that sets up the stack and clears the BSS and configures the clock tree — was, by 2014, running on perhaps two million devices, in factories and refineries and food-processing plants across South and Southeast Asia.

The engineer who wrote it died in a car accident in 2017. He was thirty-four. He had been working, at the time, in a different industry entirely, on a problem that had nothing to do with embedded systems. He had not touched the bootloader in eight years. He had, as far as I have been able to determine from a long conversation with his widow, no idea how many devices it was still running on.

The bootloader is still running. By my count — and I have spent the better part of three weeks on this — it is now installed on something in excess of nine million units. Most of those units are still in service. The bootloader has not been substantively modified since 2014.

I have been thinking, for the last several months, about what one owes to the people who will inherit one's code.

What the bootloader does

The bootloader, in its essential function, does five things, in this order. It runs from a fixed location in flash. It configures the chip's clock tree, bringing the core up from its post-reset frequency to its operating frequency, and configuring the peripheral clocks. It initialises a small amount of SRAM, sets up the stack pointer, and clears the BSS section. It checks, by computing and verifying a CRC, that the application image in the upper portion of flash is intact. If the image is intact, it jumps to the application's entry point. If it is not, or if a particular GPIO is held low at boot, it falls into a small recovery loop that listens on a UART for a new image.

This is, by any reasonable measure, what a bootloader is supposed to do. The reason I am writing about this particular one, eighteen years after its first commit, is that the author of it managed, in his first cut, to do all five of these things in a way that has not, in nine million field deployments, ever failed in a manner attributable to the bootloader itself.

Every device that has died in the field — and many have died, because nine million devices over fifteen years includes a great many failure modes — has died for some reason that is not the bootloader's fault. The flash has worn out, after a sufficiently long career of writes. The crystal has drifted, in a hot enough environment, far enough out of spec that the clock-tree configuration cannot lock. The voltage regulator has failed. The PCB has corroded. The bootloader has continued to attempt to do its job, and has, in cases where the underlying hardware was still functional, succeeded.

This is a rare property. I have written, over fourteen years, perhaps fifteen bootloaders. None of mine has had this property. I have spent the last three weeks trying to figure out what the author of the 2009 bootloader knew, that I do not, and the answer has surprised me in some respects.

A short tour of the source

The bootloader is not, by modern standards, an example of beautiful code. The variable names are short and not consistently capitalised. There are no doc comments. There is exactly one block comment, near the top of the main function, which reads: // init clocks first, see datasheet section 6.3, do not change order. The function boot_main is one hundred and seventy lines long, and contains a goto. The branch predictor on the target chip did not exist, but the author has nevertheless arranged the conditionals so that the common path falls through, and the failure paths jump.

What the bootloader does have, in abundance, is a kind of paranoid clarity about the small physical realities of bringing up a chip from cold. The clock-tree configuration is bracketed, on either side, by checks that the PLL has actually locked, with timeouts on each check. The CRC verification is performed twice, on two different passes of the application image, and the results are compared, in case a bit flips during the first pass. The jump to the application's entry point is preceded by a careful sequence in which the bootloader disables every peripheral it has configured, restores every register it has touched to its post-reset value, and only then sets the program counter.

None of this is, on its own, remarkable. What is remarkable is that all of it is there, in the first cut, in 2009. The author did not learn, over five years of field failures, that the PLL sometimes does not lock and the CRC sometimes flips a bit and the application sometimes inherits a peripheral the bootloader forgot to disable. He started by assuming that all of these things could happen, and he wrote code that handled all of them.

I asked his widow, when we spoke, whether her late husband had been an unusually careful person in other domains. She told me, with a small laugh, that he had been the kind of man who triple-checked the lock on the apartment door, and who would not eat at a restaurant whose kitchen he had not, at some point, personally inspected.

What I think the lesson is

I have been trying, for some weeks, to articulate the lesson here in a way that does not lapse into either hagiography or platitude. Both are easy. Both miss what I think is the actually useful point.

The actually useful point is that the bootloader was written by a man who, in 2009, in a small office above a bakery, assumed that he was writing it for people he would never meet. He was. The engineers who have maintained the bootloader since — and there have been several, of whom I am, for some of the more recent ports, one — have never met him. The factory technicians who flash the bootloader onto new devices have never met him. The end users of the sensors, in their refineries and food-processing plants, have certainly never heard his name. He wrote code, in 2009, as if all of these people existed and would, in the fullness of time, depend on his work.

This is not a common attitude. Most of the firmware I see, in code reviews and in audits and in the wreckage of failed products, is written as if the only person who will ever read it is the person writing it. The variable names are short because the author knows what they mean. The comments are absent because the logic is clear to the author. The error paths are unhandled because the author cannot imagine the error actually occurring. The code is, in a precise sense, written for an audience of one, and the audience of one is the author, in the moment of writing.

The 2009 bootloader is written for an audience of two: the author, and an unspecified future engineer who does not yet exist and may never read it, but who, if they do, will need to understand what is going on and why. The second audience is what gives the code its particular shape — the careful comments at the points where the order matters, the explicit references to datasheet sections, the paranoid double-checks at every place where the hardware can lie. The author was, in the moment of writing, in conversation with someone he could not see.

On writing for the unseen

I have started, in my own bootloaders, to try to write for the second audience. I am not yet very good at it. The instinct to write for an audience of one is strong, and the deadlines that surround a bring-up are not friendly to the patient elaboration of comments and the careful naming of variables and the explicit handling of edge cases that probably will not occur. I find, over and over, that the code I commit on a Friday afternoon is the code of the audience of one, and the code I commit on a Sunday morning, when I have had time to think, is sometimes the code of the audience of two.

The 2009 bootloader was committed at 4:17 PM on a Saturday, according to the SVN log. I do not know what kind of Saturday it had been for the author. I do know that the code is the code of an engineer who had, at some point in his short career, learned to write for the unseen audience, and who could do it, by 2009, even on a Saturday afternoon when he was probably also tired and would probably rather have been doing something else.

The bootloader will, by my estimation, still be running on some non-trivial number of devices in 2035. By then it will have been in service for twenty-six years. The chip families it targets will be long out of production. The companies that built the sensors will have been acquired and folded and re-spun. The author will have been dead for nearly two decades. The bootloader will continue, at every power-on, to do its small, careful job, written for engineers who, by then, will not have been born when it was first committed.

I would like, before I retire, to write one piece of code that has this property. I do not know whether I will manage it. The discipline is harder than it looks.

A note on the prose. The reporting in this magazine is one writer’s — mistakes are her own, and corrections are welcome. Where code appears, it has been tested on at least the hardware named. Where photographs appear, they are the author’s unless captioned otherwise.