Much has been written on the failure of counting lines of code (LOC) as a metric for productivity. I won’t further expand on this except to note that what I consider the most compelling argument against LOC is that it will penalise a superior design that can be expressed more succinctly. But if we don’t use LOC, what do we use?

I’m of the opinion that the best low level metric for the development of new features is the rate at which unit tests are being added or modified. Unit tests form an executable specification of the behaviour of software which is built up over time. Each test specifies a behaviour of the software independent of the amount of code we actually end up using to provide the behaviour. We can therefore correlate development progress with the number of unit tests we have.

That sounds great, a metric that relates directly to what we want. Unfortunately its not really that simple. Tracking the number of unit tests produced for new features may track the development of those features but not all development activity is adding features. Maintenance is critical concern and when modifying a system we may not end up modifying many tests at all. In some cases we may even remove them as they become irrelevant to the new shape of the application. Clearly implying a developer has negative productivity in this scenario is invalid yet this is what the metric would have us believe.

Defect resolution is similarly problematic. You may write some tests to prevent the reoccurrence of an issue but the total number of tests added per unit time will be significantly below that experienced with developing new code.

Additionally there are some types of code that do not readily lend themselves to unit testing such as UI or integration code. Developers are still productive when writing this code yet would not be writing tests.

To return to my contention that unit tests are “the best low level metric for the development of new features”, I make this claim with an assumption that the metric is used primarily for tracking new development and that it is used in context with other information. In particular it is important to understand that the raw unit test numbers are by themselves potentially misleading. This is due to differing styles of writing tests and differing levels of complexity.

I personally favour many small tests that each validate a single criteria, with multiple tests for each element of an action that I wish to validate. On raw numbers this would indicate hugely increased productivity over someone who writes larger, heavier tests. I would argue such tests are less effective, but not that they necessarily represent many times lower productivity. Scaling to account for these differences may be necessary and the resultant numbers will need intelligent interpretation. Using the metric on its own would also be open to abuse by less scrupulous developers who could write many unnecessary tests.

What this ultimately means is that unit test velocity can be an effective guide provided that it may be correctly interpreted. As such it is a good local measure to help teams track their progress. Individual developers may consider their velocity to identify potential problems. This provides team members and direct project management with useful information to track the state of their project and take early corrective action. However given the limitations above it is not a suitable measure against which ultimate performance can be judged.