The sordid life of software bugs
- Triage – (French, from trier, to sort)
2. the assigning of priority order to projects on the basis of where funds and other resources can be best used, are most needed, or are most likely to achieve success.
In over 20 years of professional software development, I’ve never yet worked on a software project that had sufficient time and resources to do everything we wanted. One of the most painful tasks that project leads face is making tough choices about what features to add and remove, and what bugs to fix. Alas, Expression Web is no exception.
When you’re dealing with an application with so many features, it’s not really a surprise that there might be a few bugs. The real question is, given our limited resources and time, how do we decide which bugs to fix as we individually review each bug? Before I answer that, let me back up a step and review the various ways that bugs are found and submitted for triage.
Bugs that are filed by our users, usually via the Microsoft Connect website at https://connect.microsoft.com/expression – we have a very knowledgeable core base of users who have found some of the oddest, as well as some of the most common, bugs in our program. While we would, of course, prefer that our users never experience a single bug, we’re very grateful for those users who have reported bugs to us, particularly when those users go to the extra effort of doing some investigations on their own, reporting additional data that can help us reproduce the bug and fix it.
Bugs that have been postponed from version 1 or version 2 – the bug may have been postponed because it came in too late to fix or it was in an area of code that we hadn’t planned to invest in. If it was important enough, though, it got held over and re-triaged in the version we’re currently developing.
Bugs in features developed for version 1 or version 2, discovered by someone on the Expression Web team (usually by our test team or by other team members) – even features that have been in the product for a couple of releases can still have a bug crop up through some new set of circumstances or someone on the team saying, “What will happen if I do....?”
Bugs in new features, discovered by our test team – as new features are developed, our test team puts those new features through a rigorous set of automated tests and manual tests, and they inevitably find several bugs in the process.
Regressions – Occasionally, we’ll change code in one area of the program, either to fix a bug or implement new functionality, which may cause a problem in another area of the program. These bugs are flagged as regressions – something that used to work that no longer does.
All of these bugs are added to our bug database and they are considered, and occasionally reconsidered, in daily triage sessions. These triage sessions include one or more representatives from the development team, the test team, and the program management team. Discussions in our triage meetings can be fairly extensive, and occasionally heated, as we tend to be very passionate about the quality of our program, as well as the importance of meeting our deadlines.
So how do we actually decide which of these bugs to fix? We look at several factors:
What is the priority? – Every bug is given a priority rating of 1, 2, or 3. A priority of 1 means that the bug is very important and that we need to fix it in the current milestone, would even hold back the program from shipping, if necessary. A priority of 3 means that this is a “nice to have” fix that we would like to do if time permits. Sadly, time rarely permits, and most Priority 3 bugs remain unfixed.
What is the severity? – Every bug is given a severity rating of 1, 2, 3, or 4. A severity of 1 means that there is significant user impact – that the bug can cause a crash or data loss, or has security and privacy implications. A severity of 4 means that there is minimal user impact – this could be something like awkward wording on an error message that few of our users will ever encounter.
What is the risk? – The risk of fixing a bug has to do with where the fix is in the code and how likely it is to cause a regression in that feature or in some other feature. Evaluating the risk is highly subjective but developers and testers alike develop over time a pretty good idea of where our potential “problem” code is and where the “safe” code is. All other factors being equal, we’re far more likely to approve a bug in code that’s safe than we are in code that’s high risk, particularly as we get closer to the end of the project.
What is the cost? – As with risk, the cost of fixing a bug is subjective, but it is still an important consideration. If, for example, a developer has left out a little-used parameter from a PHP command, such that it doesn’t show up in our IntelliSense list, that’s a bug with pretty low severity and priority ratings. On the other hand, I can have this fixed and tested in less than an hour, which means that it will likely be accepted by our triage team even with its lower priority. The test cost is just as important as the development cost, since there are times when a bug may be easy for a developer to fix, but the fix may be in an area that affects several core scenarios, requiring the need for a full test pass to validate the fix.
What is the likelihood that our users will encounter this bug? – The more likely it is that our users will encounter a bug, the more likely it is that we will fix it. A bug that hits 1% of our users roughly 1% of the time is not very likely to be fixed. If this hits everybody, though, and it’s something that they will encounter fairly frequently, we would definitely want to fix it.
Is this a regression? – A bug in a feature that used to work that now no longer does will be given a higher priority than if that same bug had been there all along and had just never been found. We’re actively investing in automated tests of both the basic functionality of our product and the high priority and high risk bugs that we find. Our goal, over time, is to reduce the number of regression bugs introduced in the product by catching them as soon as the developer makes the mistake.
Where are we in the development process? – We often get bug reports, either from our internal users and testers or from our Public Beta users, very late in the program lifecycle. The later we are in our development process, the higher our triage bar gets. By the time we get to the final weeks of the project, we are usually only willing to consider recall-class bugs – a bug that is so devastating that we would literally recall the program if we found that we had submitted the program for release with this bug included. One of the most frustrating scenarios we face in triage is rejecting a bug that we know we would have fixed had we found it just a few weeks earlier. As noted above, we’re investing heavily in improving our automated test suite so that, in future releases, we can take bugs fairly late in the development process with a higher confidence level that the fix won’t cause a problem elsewhere in the product.
Is this a security or data loss issue? – We take both of these scenarios very seriously, more seriously, perhaps, than just about any other kind of bug. If someone can compromise your data or your machine because of a bug in our code, we cannot, and will not, ship the program with that bug. If a bug can cause you to lose data that you have just spent hours entering, then we will do our best to fix it.
Is this bug in a core investment area? – In each release, we pick a few areas of the code that we intend to invest in heavily. If a reported bug is in one of those core investment areas, we will look on it a bit more favorably than we otherwise would.
An example of a really painful triage decision we had to make in Expression Web 2 was the issue of link fixup for PHP include files. A bug came in fairly late in our development process regarding including a PHP file in an HTML file, then renaming that PHP file. In other link fixup scenarios, the program recognizes that you have renamed a file that is referenced by other web pages and asks you if you want to automatically adjust the links to point to the renamed file. For included PHP files, this process was failing.
After careful investigation, we had the following answers for the various triage criteria:
- Priority = 1. This is a core scenario.
- Severity = 2. This wasn’t a crash, a security issue, or data loss but it was pretty important functionality.
- Risk = High. The change affected code that was central to quite a few core scenarios and it was in code that had had problems in the past. The risk of regression was quite high.
- Cost = High. The development cost was medium but the test cost in this case was pretty high.
- Likelihood = High. This is a core scenario that anyone using PHP includes is likely to encounter.
- Regression = No. This was new functionality, so no regression.
- Where are we in the development process = Late.
- Security or data loss = No.
- Core investment area = Yes.
Earlier in the development cycle, this bug is a no-brainer to fix. It’s a core scenario that is likely to affect many of our users, in an area where we’re investing significant resources. End users can work around this issue by using the Find and Replace feature, or manually updating their links, but that’s not as convenient and it’s contrary to our behavior with other types of included files. However, given the risk, the cost, and the time that the bug was reviewed, we ultimately decided that, painful though it was, we just could not take this fix for Expression Web 2. I’m pleased to say that the bug has already been reconsidered for, and fixed in, Expression Web 3, and that the fix will be available in our first preview release.
As you can imagine, seeing the list of criteria, not to mention seeing how subjective many of them are, our triage sessions are often contentious. At Microsoft, one of the core attributes we value and cultivate is passion and we definitely have a lot of passion around which bugs we want to fix and which we don’t. At the end of the day, though, the one thing that holds us together is that we also have a passion for doing the right thing for our end users. Sometimes, the right thing is to not fix the bug, as with the example I cited above. We just could not take the risk of destabilizing the program, potentially causing any number of other problems for our end users, painful though that decision was. Sometimes, though, and these are happier decisions, the right thing is to fix the bug, as was the case when we re-triaged the bug for Expression Web 3.
If you are ever wondering whether you should submit a bug report to the Expression Web team, the answer is an unqualified yes. We need this kind of feedback from you, our end users, in order to help us meet our goals for shipping a quality product. The detail you provide in your bug report could be the final clue that helps us track down the cause of the bug and yours could be the report that helps us fix a bug that thousands of our users could encounter. We need the feedback from you, both positive and negative, in order to help us assess whether we’re meeting your needs and our own goals. Without you, and your support, we cannot succeed. We are always grateful for any feedback you can provide and will always do whatever we can to meet your needs.
Paul Bartholomew
Development Lead
Expression Web
Comments
Anonymous
September 25, 2008
The secret life of software bugs, exposed! is a very interesting look into the software bug process withinAnonymous
March 03, 2011
The comment has been removed