Of the opinions I hold about software development probably the most counter-intuitive is that software that fails fast is more robust. After all, if the software breaks all the time it can hardly be considered robust. However there are a number of reasons I hold to the practice of failing fast and consider it to be a vital practice in building quality software.

The first thing to note is that there is a distinct difference between failing and breaking. Failing is being unable to complete an action because required conditions are not met. Failing is normal and expected. Your application will have to deal with failures cause by users, other systems and even itself. Breaking is different. An application that is broken may not appear incorrect but will produce incorrect behaviour. This incorrect behaviour may not be noticed for some time, at which point it may be extremely expensive or impossible to fix.

Failing fast is therefore about detecting errors as early as possible and dealing with them. Generally this is by preventing actions from being performed if pre-conditions are not met. Validation on forms is an excellent example of this. Attempts to submit the form when it is in an invalid state fail immediately and the user receives a notification. Also important in this example is that the software itself does not crash and the user is able to rectify issues and continue.

In most cases software will be able to deal with a failure case without terminating. In an interactive environment this can be by providing the user with notification and the ability to correct as in the example above. In service applications this can be by cleanly refusing to process an invalid request without affecting the ability of the service to continue processing valid requests. The invalid request may then be retried (to potentially rectify temporal related failures) or logged or otherwise captured for later rectification. Allowing an invalid request to proceed could corrupt the service state in ways that are not immediately apparent. For instance a request that incorrectly sets a negative tax rate could, if not detected and caused to fail, result in significant financial implications up to and including jeopardising the viability of the business.

Although the ideal case is that software recovers from failures and continues this is not always possible. There are times when the software gets to a state from which there is no legitimate recovery path. At this point clean termination of the software is the appropriate path. Once the software cannot be recovered further action will only result in incorrect behaviour. Users hate it when their software terminated. But they hate it far more when it lies to them about how it is behaving.

In my experience there are generally two sub-optimal practices used for dealing with failure conditions. The first is to just ignore the possibility of failure and proceed along the so-called "happy case". This is generally the most popular because it's easy, right up until the point where your assumptions fail and everything falls in a flaming heap. The second is to make your software far too desperate to stay alive. I've seen cases where there are multiple points build into the software where all exceptions are caught and ignored. Everything may be broken, but the users won't be told. They'll just have to find out when they find that the software has misbehaved and all their data is lost or corrupt. These practices are generally combined to produce extremely fragile software that looks solid if you never examine its results.

In my opinion there are generally three types of appropriate response to a failure:

  • If the error is in user input and is correctable then notify them and seek correct input. This may include cases where the input is invalid due to the current state of the system or where the input is invalid in all contexts.
  • Where a request to a service is invalid reject the request before processing it. This ensures the service can continue to process valid requests. Appropriate logging of the failure should be performed. This logging may be nothing more than an error result (for instance from a SQL Server when invalid SQL is provided) or might involve explicit notification and logging of the request (such as in a custom business service). The nature of the response is context dependent but the error should be made known in an explicit fashion.
  • Where software determines that it cannot continue it should terminate gracefully and provide notification. This is not the preferred option but it is important that this decision be made when required. Explicit failure at this point prevents incorrect behaviour. It also allows the error notification to be more explicit as to the cause of the termination which aids in defect resolution.

I am generally opposed to automatic resolution of errors. There are a number of reasons for this:

  • It disguises the fact that the defect exists, making detection and resolution more difficult.
  • In many cases automatic resolution will be a guess as to the correct result and this guess may be wrong. This could result in behaviour worse than the original error.
  • Errors often occur due to edge cases not anticipated in the main code. There is no reason to assume that they will be anticipated and resolvable in the resolution code.

In summary:

  • Errors detected earlier are easier to fix.
  • Detecting errors early prevents incorrect behaviour and data corruption.
  • Assuming everything is OK or ignoring errors produces bad outcomes.
  • Consider the possible failure cases and either prompt for correction, reject the request or terminate the software.