Larry's rules of software engineering #2: Measuring testers by test metrics doesn't.

This one’s likely to get a bit controversial J.

There is an unfortunate tendency among test leads to measure the performance of their testers by the number of bugs they report.

As best as I’ve been able to figure out, the logic works like this:

Test Manager 1: “Hey, we want to have concrete metrics to help in the performance reviews of our testers. How can we go about doing that?”
Test Manager 2: “Well, the best testers are the ones that file the most bugs, right?”
Test Manager 1: “Hey that makes sense. We’ll measure the testers by the number of bugs they submit!”
Test Manager 2: “Hmm. But the testers could game the system if we do that – they could file dozens of bogus bugs to increase their bug count…”
Test Manager 1: “You’re right. How do we prevent that then? – I know, let’s just measure them by the bugs that are resolved “fixed” – the bugs marked “won’t fix”, “by design” or “not reproducible” won’t count against the metric.”
Test Manager 2: “That sounds like it’ll work, I’ll send the email out to the test team right away.”

Sounds good, right? After all, the testers are going to be rated by an absolute value based on the number of real bugs they find – not the bogus ones, but real bugs that require fixes to the product.

The problem is that this idea falls apart in reality.

Testers are given a huge incentive to find nit-picking bugs – instead of finding significant bugs in the product, they try to find the bugs that increase their number of outstanding bugs. And they get very combative with the developers if the developers dare to resolve their bugs as anything other than “fixed”.

So let’s see how one scenario plays out using a straightforward example:

My app pops up a dialog box with the following:

 

            Plsae enter you password: _______________ 

 

Where the edit control is misaligned with the text.

Without a review metric, most testers would file a bug with a title of “Multiple errors in password dialog box” which then would call out the spelling error and the alignment error on the edit control.

They might also file a separate localization bug because there’s not enough room between the prompt and the edit control (separate because it falls under a different bug category).

But if the tester has their performance review based on the number of bugs they file, they now have an incentive to file as many bugs as possible. So the one bug morphs into two bugs – one for the spelling error, the other for the misaligned edit control. 

This version of the problem is a total and complete nit – it’s not significantly more work for me to resolve one bug than it is to resolve two, so it’s not a big deal.

But what happens when the problem isn’t a real bug – remember – bugs that are resolved “won’t fix” or “by design” don’t count against the metric so that the tester doesn’t flood the bug database with bogus bugs artificially inflating their bug counts. 

Tester: “When you create a file when logged on as an administrator, the owner field of the security descriptor on the file’s set to BUILTIN\Administrators, not the current user”.
Me: “Yup, that’s the way it’s supposed to work, so I’m resolving the bug as by design. This is because NT considers all administrators as idempotent, so when a member of BUILTIN\Administrators creates a file, the owner is set to the group to allow any administrator to change the DACL on the file.”

Normally the discussion ends here. But when the tester’s going to have their performance review score based on the number of bugs they submit, they have an incentive to challenge every bug resolution that isn’t “Fixed”. So the interchange continues:

Tester: “It’s not by design. Show me where the specification for your feature says that the owner of a file is set to the BUILTIN\Administrators account”.
Me: “My spec doesn’t. This is the way that NT works; it’s a feature of the underlying system.”
Tester: “Well then I’ll file a bug against your spec since it doesn’t document this.”
Me: “Hold on – my spec shouldn’t be required to explain all of the intricacies of the security infrastructure of the operating system – if you have a problem, take it up with the NT documentation people”.
Tester: “No, it’s YOUR problem – your spec is inadequate, fix your specification. I’ll only accept the “by design” resolution if you can show me the NT specification that describes this behavior.”
Me: “Sigh. Ok, file the spec bug and I’ll see what I can do.”

So I have two choices – either I document all these subtle internal behaviors (and security has a bunch of really subtle internal behaviors, especially relating to ACL inheritance) or I chase down the NT program manager responsible and file bugs against that program manager. Neither of which gets us closer to shipping the product. It may make the NT documentation better, but that’s not one of MY review goals.

In addition, it turns out that the “most bugs filed” metric is often flawed in the first place. The tester that files the most bugs isn’t necessarily the best tester on the project. Often times the tester that is the most valuable to the team is the one that goes the extra mile and spends time investigating the underlying causes of bugs and files bugs with detailed information about possible causes of bugs. But they’re not the most prolific testers because they spend the time to verify that they have a clean reproduction and have good information about what is going wrong. They spent the time that they would have spent finding nit bugs and instead spent it making sure that the bugs they found were high quality – they found the bugs that would have stopped us from shipping, and not the “the florblybloop isn’t set when I twiddle the frobjet” bugs.

I’m not saying that metrics are bad. They’re not. But basing people’s annual performance reviews on those metrics is a recipe for disaster.

Somewhat later: After I wrote the original version of this, a couple of other developers and I discussed it a bit at lunch. One of them, Alan Ludwig, pointed out that one of the things I missed in my discussion above is that there should be two halves of a performance review:

            MEASUREMENT: Give me a number that represents the quality of the work that the user is doing.
And EVALUATION: Given the measurement, is the employee doing a good job or a bad job. In other words, you need to assign a value to the metric – how relevant is the metric to your performance.

He went on to discuss the fact that any metric is worthless unless it is reevaluated at every time to determine how relevant the metric is – a metric is only as good as its validity.

One other comment that was made was that absolute bug count metrics cannot be a measure of the worth of a tester. The tester that spends two weeks and comes up with four buffer overflow errors in my code is likely to be more valuable to my team than the tester that spends the same two weeks and comes up with 20 trivial bugs. Using the severity field of the bug report was suggested as a metric, but Alan pointed out that this only worked if the severity field actually had significant meaning, and it often doesn’t (it’s often very difficult to determine the relative severity of a bug, and often the setting of the severity field is left to the tester, which has the potential for abuse unless all bugs are externally triaged, which doesn’t always happen).

By the end of the discussion, we had all agreed that bug counts were an interesting metric, but they couldn’t be the only metric.

Edit: To remove extra <p> tags :(

Comments

  • Anonymous
    April 20, 2004
    The comment has been removed

  • Anonymous
    April 20, 2004
    I'll bite on the controversy.

    That's a bit one-sided (says Drew the tester).
    Blindly-applied metrics like that can motivate developers to be unreasonable, too. When a dev manager says "if you have more than X open bugs you can't do new work on project Y" (yes- this can happen), we have devs who refuse to fix real bugs, resolving "won't fix" or "by design" or even "no repro" so that their bug counts can remain low. That's incredibly frustrating as a tester. Especially when the dev is in a different org or even a separate division of the company - we testers can't even get away with schmoozing to get the fix then.
    For the record, I tend to agree that lumping several related problems into one bug is easier to track. The flip side is that I've also had bugs resolved as fixed when not all of the problems in the bug report were fixed yet. The right style seems to depend on the developer the bug is assigned to as much as the tester that filed the bug.

    I'm not so sure the validity of the metrics is the problem either. A true measurement could still be meaningless. IMHO, the root of the problem is determining how to make information from data.

  • Anonymous
    April 20, 2004
    The comment has been removed

  • Anonymous
    April 26, 2004
    Being an STE myself I must agree with Mrs. Osterman. I have to say that basing evaluations of a Tester solely on bug stats seems completely short sighted. Even if you are taking into account the priority and or severity of the bugs found.

    There are other aspects to being a software tester than writing bugs. Quantifiable stats such as test cases written, test scenarios written, test cases completed, test scenarios completed, just to name a few.

    Evaluation becomes more difficult when we attempt to quantify the more intangible quality criteria such as leadership, communication, thoroughness, etc.

  • Anonymous
    April 26, 2004
    I guess you meant "omnipotent", not "idempotent".

  • Anonymous
    April 26, 2004
    No, I meant idempotent :) From Dictionary.Com:
    http://dictionary.reference.com/search?q=idempotent

    Having said that, a better word would have been "interchangable". My usage follows from the jargon usage (a header file is considered idempotent if it can be safely included twice - thus w.r.t. order of inclusion it is interchangable) but that's a stretch.

  • Anonymous
    May 03, 2004
    The comment has been removed

  • Anonymous
    May 03, 2004
    An interesting point Quentin.
    I think that in general, most metrics of this type ignore quality differences between different members of a given development team, which is, of course silly.

    Which gets back to the question of "Is the metric valid?" In the case you're describing above, it's not clear that it is. In other circumstances, it may be (the testers might be testing a completely new product developed by developers whose defect rate is constant). It depends on the circumstances.

  • Anonymous
    May 05, 2004
    Chiming in way late here, following a link from Joels page of last week, not sure if you've discussed this futher elsewhere, but:
    Valorie, isn't the customer of the tester the customer of the business? So rather than bug counts or dev reviews, the testers rewards should be based on bug reports filed by clients?

    This'd have the positive impact that the quality of the team is measured, rather than the quality of the dev or of that individual tester - both of which are subjective measures anyway.

    It would also make your assessment of severity easier : a spelling mistake mentioned by your 10,000 seat client may have more business impact than when the 20 seat client screams at you about the false positive on file save.

  • Anonymous
    May 07, 2004
    The comment has been removed

  • Anonymous
    June 19, 2004
    The comment has been removed

  • Anonymous
    June 19, 2004
    Correction, sorry, it should read: "http://www.mozilla.org should be http://www.mozilla.org/", i.e. he only wanted the slash added.

  • Anonymous
    July 25, 2004
    Great post Larry. Thanks for picking up this "burning issue" that test leads/managers world wide face today.

    My thoughts are:
    Though it has been fiercely fought, extensively debated and beaten to death - "We should not measure tester’s performance only one the basis of # bugs they log", I think if we were to put a metric around that to make it SMART, # of bugs may be closest that one can think of. Among several work products the testers produce, "bugs" are single most important and have direct impact on quality of the final product that goes out customer.

    So are there any efforts to classify the bugs logged standardize and normalized so that we can compare 2 bugs. once we have decent framework to classify the bugs to take care of all possible variable that make one bug different from other, it would be fair to measure the performance of two testers.

    Having said all this, I strongly believe that there should way that quantifies output of a tester and measure.



    Shrini

  • Anonymous
    June 20, 2005
    Joel just sent me an email letting me know that the first edition of &quot;Best Software Writing, I&quot; has now...

  • Anonymous
    June 25, 2005
    John Gruber makes an appearance in the soon to be released book The Best Software Writing I which was put together by Joel Spolsky .

  • Anonymous
    July 25, 2005

    I have just finished reading a book compiled, edited and introduced by Joel Spolsky and released by Apress. &amp;quot;The Best Software Writing I&amp;quot; is a collection of some of the best articles on software development, and management written on various w

  • Anonymous
    March 15, 2006
    Well, this year I didn't miss the anniversary of my first blog post.
    I still can't quite believe it's...

  • Anonymous
    November 02, 2006
    Vuelvo a leer aAntonio Quiros explicando su visi&oacute;n sobre la gesti&oacute;n de proyectos, en el

  • Anonymous
    April 10, 2007
    PingBack from http://www.jtheo.it/2007/04/11/a-proposito-di-software-joel-spolsky/

  • Anonymous
    April 21, 2007
    PingBack from http://www.livejournal.com/users/jace/385188.html

  • Anonymous
    April 21, 2007
    PingBack from http://www.ljseek.com/introduction-to-best-software-writing-i_61114144.html

  • Anonymous
    June 05, 2007
    PingBack from http://blog.rushchak.com/index.php/2007/06/06/joel-spolsky-best-soft-book/

  • Anonymous
    January 21, 2008
    PingBack from http://softwareinformation.247blogging.info/larry-ostermans-weblog-larrys-rules-of-software-engineering-2/

  • Anonymous
    April 17, 2008
    PingBack from http://www.testingperspective.com/blog/?p=10

  • Anonymous
    June 09, 2008
    PingBack from http://cotoha.info/thoughts/good_reading_the_best_software_writing_i/

  • Anonymous
    June 01, 2009
    PingBack from http://indoorgrillsrecipes.info/story.php?id=3139

  • Anonymous
    June 13, 2009
    PingBack from http://firepitidea.info/story.php?id=1489

  • Anonymous
    June 19, 2009
    PingBack from http://debtsolutionsnow.info/story.php?id=2165