When Triple-Redundancy isn't enough

IEEE Spectrum: Space Station: Internal NASA Reports Explain Origins of June Computer Crisis is an article detailing the root cause of the total failure of the International Space Station computers. This is facsinating and frightening at the same time!

I recall a similarly baffling situation in a previous life when I worked on embedded controllers for medical diagnostic equipment. The instrument was temperature controlled internally with some heaters on tops of the diagnostic channels and fans underneath to keep the electronics cool. One machine deployed for clinical trials repeatedly would get to hot. This is a bad thing for the chemistry involved as the reactions have to occurr within a pretty tight window to be considered valid. The field support engineers couldn't reliably repeat the heat conditions and we were preparing to scramble a team of hardware engineers, software engineers, and chemists to debug the problem.

Well it turned out this machine was in a basement room with a western exposure and windows along the top of the exterior wall. Everyday sometime after noon, the sun would shine through the windows, which didn't have blinds or curtains, and heat up the instrument. The solution, some curtains (or as it turned out for the short term, some cardboard) didn't become obvious until one insightful support engineer requested to do some tests during the day rather than afterhours.

This article really interests me because of the phsychology of the people involved in trying to solve the problem. I'm not being anti-Russian or pro-America, but designing complex software and mechanical systems is hard work - and honest mistakes (I guess the debate on the honesty of the mistake on the ISS is debateable) do happen.

In my world today, TDD and unit-testing are held up by many as the solution to all our software development woes. I do beleive unit-tests have a prominent position and are very, very important assests...but nothing replaces insight, understanding, context, and zero-tolerance for finger-pointing - all non-technical skills.

Comments