What makes a test suite good?

May 19, 2007

Many people enjoy splitting testing up in a myriad of test types: Acceptance Tests, Functional Tests, Integration Tests, Performance Test, Technical Tests, Unit Tests. I have myself been guilty of such terminology as “embedded integration tests” and “requirement tests”. However, what unites the tests are more important than what divides them. The divisions are fuzzy, and they should be.

All tests have but two purposes: To tell you if you’ve completed a new requirement, and to ensure that you haven’t broken something that worked. There are three fundamental properties of a good test suite: Coverage, Robustness and Speed.

The properties of a good test suite

Coverage: I use the term coverage with some apprehension, as it has an existing and problematic definition. Test coverage to most people means line and/or branch coverage. That is: How many percent of your code is executed when you run the test suite? This metric can be misguiding, and it is probably not the goal you want. Instead, I propose a different definition of coverage: Coverage is the percentage of bugs you introduce into your code that are detected by your test suite. Stated in a different way: The higher number of false positive tests when you change your code, the lower your coverage. Improving your line or branch coverage may or may not improve the chance that your test suite catches a defect. It may or may not be a good investment of time to improve how safe you are. Chances are that if your line coverage is over 70 %, there are better things to spend your time on than improving it further. And those things may in fact improve your line coverage as a result. Robustness: The problem with the easy focus on line and branch coverage that tools give us is that it tends to hurt other characteristics of a good test suite. If you add a test to make sure that all the internals of your system are tested, chances are good that this test can break because of a non-destructive change. I’ve found that teams with high test coverage always seem to run into the problem of the Fragile test. The fragility of a test suite can be described as the number of changes that breaks a test even though they did not introduce a defect. Stated in a different way: The higher number of false negative tests you have when you change your code, the lower your robustness. You make tests more robust by testing the outcome and not the mechanism. Incidentally, I have found the mock objects seem to make my tests more fragile. Speed: So a test breaks. What happens now? Presumably, you try to isolate the behavior that breaks, maybe by running a smaller suite of tests. Then you make some changes to the code (if there was an actual bug) or the test (if it was a false positive), you run the failing test again to check whether the problem is fixed. Repeat until done. Then you run the whole suite again to check that you didn’t introduce some other problem. There are a few critical thresholds for tests when it comes to execution time. More than about 2 seconds, and I check my email. More than 10 seconds, and I try to respond to email. More than 20 seconds, and I start working on two tasks in parallel. More than 1 minute, and I go for a cup of coffee. Each of these secondary effects are ten times as time consuming as the test. This means: Let’s say I introduce a bug where I need 5 attempts to fix it, and I introduce another bug that I detect when I run the full suite and that takes another 2 attempts to fix. So I run the full suite first, after fixing the first bug and after fixing the third bug. And I run a single test about 7 times. If running a full suite takes a minute and running a single test takes 10 seconds, this will have taken me 3 times * 1 minutes * 10 minutes for coffee + 7 * 10 seconds * 100 seconds to answer an email = 2500 seconds or about three quarters of an hour. If running the a tests took 1 second (no interruption) and running the suite takes 10 seconds (I’ll watch it for that long), the test time will be less than a minute. But I didn’t get to write those three emails.

The universal test

The suggested difference between an integration test and a unit test is the time it takes to run the test. The difference in running time is caused by the fact that an integration test has more setup and more realistic infrastructure. However, we usually want to test the same scenarios. I would like to submit to you, gentle reader, that it is not only possible, but quite feasible to write a test that can be used both as a “unit test”, running with fast, in-memory implementation, and as an “integration test”, using the target infrastructure. This achieves the goals of high coverage, good robustness, and the right speed, by focusing on what the system is supposed to do, and using the infrastructure setup as a point of variation.

Comments:

Johannes Brodwall - May 21, 2007

Hi, Anne Marie

Thanks for the comment. I agree that in principle, TDD will give us high line coverage. In practice, I have not seem a code base that gets more than 90 % or so (my own). I don’t know why this always happens. In Java, it seems like checked exception is a part of the culprit, however.

However, getting a 100 % line coverage doesn’t mean that you get perfect coverage in the sense of reducing the chance of a bug sneaking though.

About Universal Tests: I think they can be taken a lot further than people are doing today, and I think it will be beneficial. I am looking forward to working more on this topic. Hope to get your comments. Did you see the test cases I linked in?

Anne Marie - May 19, 2007

Since this is a very interesting discussion I just have to share some of my opinions on some of the subjects.

Test coverage: If we develop test-driven (and even better: behavior-driven), line/branch coverage should (in theory) never be a problem, since you will only write code that you have already written a test for. In theory that should give you 100% test coverage PLUS your tests are to the point. Why can’t we achieve this? Is it because we aren’t doing real TDD/BDD? Is it actually possible to develop the whole system test-driven?

Test coverage II: Don’t introduce special test frameworks that tests getters and setters because you can’t really write useful tests for your getters and setters, and leaving them untested lowers your test coverage (in the “old” meaning of the word). If your getters/setters aren’t used indirectly by your other tests, and you don’t want to test them explicitly, then why do you have them there at all? In addition to being absolutely pointless, these tests have a tendency to get in your way when you are actually developing real functionality in your system.

Mock objects: A good rule of thumb is to NOT use mock objects if what you really need is to stub up some interface. Implement a stub instead. This will often be both more robust and more reusable. Use mock objects when you actually want to test that the class you are testing is calling a method in another class.

Universal test: This seems like a very good idea, but the (few) times I have tried to achieve this, for some reason or another we couldn’t do it. It’s a cost/value question, I guess.