Good luck NASA, or "The Dreaded No Repro"

Poor NASA. Yesterday's shuttle launch was scrubbed at the last minute due to a malfunctioning fuel sensor. This particular sensor is responsible for shutting down the engines if the shuttle runs out of fuel. Apparently, their engine can run without fuel, but it's a very bad thing. They have 2 other redundant sensors in the system that are working, but they made the right call in postponing the mission until the problem was understood.

This isn't the first time NASA engineers have seen this particular problem. Back in April, as I understand it, the same sensor malfunctioned briefly. Then it worked. Then it malfunctioned. Then they replaced a whole bunch of "magic shuttle parts" and it worked again. Problem solved... or so they thought. The key is that they never figured out why the original part malfunctioned. This is the dreaded no repro bug.

I face these daily as a software tester. You find a bad bug, you narrow down the steps, you file a detailed bug report, and the developer resolves it as "no repro" because they can't create the problem on their own test environment. There are a million possibilities for why this happened: insufficient repro information in the bug report, a misinterpretation of the repro steps on the developer's part (possibly due to ambiguous repro information), the product has changed significantly since the bug was found, different configuration environment (e.g you're running Windows XP, but the dev had Windows Server 2003 installed), sunspots, etc.

At this point, you have several options:

  1. Blindly trust the developer (and if you do, I have some prime land in Florida to sell you) and close the bug
  2. Try to reproduce the bug yourself again. This may be difficult if you've already updated your system to a newer build than the one you found the bug on. Or it might just take a long time, or it might work! If you can reproduce it, provide remote access to the developer or ask them to come look at your machine in person.
  3. Go to the developer's machine and try to reproduce it there
  4. Broadcast the bug report to your team to see if anyone else has supporting data or has seen the bug themselves

Of course, which path you choose depends on lots of things, too much to go into here. In the end, you need to reproduce the problem and understand the core issue before you can have confidence in the fix. It's important to remember that you can always prove something doesn't work, but it's really hard and often impossible to really prove something does work. This is the big challenge NASA faces today.

Good luck, NASA.

Comments

  • Anonymous
    July 20, 2005
    The comment has been removed
  • Anonymous
    July 31, 2005
    Cleaning-out my “To Blog” file again…
    Architects

    Handling data in service oriented systemsEdward...