I spend a lot of my time getting annoyed by errors in other people’s software (e.g. Windows). Errors which, when you see them, you wonder how on earth they could have been overlooked. But recently I released of a piece of software which contained a major bug (it was only a small mistake, but the consequences were big).
So I set about thinking, what sequences of actions lead, in my experience, to software which works? A lot of these are obvious, yet I’ve found myself often enough not following them, due to time or pressure reasons. And the result: is stuff which doesn’t work.
(1) Unit test scripts: Make them easy to run. For one product I work on a lot, there’s are a whole bunch of test scripts, testing all sorts of classes. In fact there are over 21k lines of unit tests! This is a good thing. But sometimes the person running them has to compare the value printed by the program with the expected value (i.e. has to know the expected value, not easy 2 years after the program was written). And not all classes are tested at all. But there are still a good few which do good tests and print “ok” if the result is correct. This is good, but it’s so much work to run them all. The solution is to chain them altogether, as is easy to do with JUnit, and create one simple command or click to test them all. If it’s simple and convenient and creates value, people will do it.
Also, having a framework into which to put tests – for example, having a convention that a class called “X” has a test class called “XTest”, and that methods like “operationY” on “X” have a method “testOperationY” in “XTest” – encourages people not to be scared to write tests. (But forcing people to write tests, e.g. one test per method, is a waste of time. Not every method needs a test.)
(2) Know what the important features are. Most websites really have many many features. It’s impossible to test them all, without restricting oneself to 6 month release cycles and 1 month test phases. But there are usually a bunch of features would would be show-stoppers if they didn’t work. Can a new user register? Can they upload a photo? (For a photo website). Can they send an SMS (For an SMS website). Write these show-stopping features down. Before the release, go through and test them on the pre-production server. After the release, test them again on the live system. Writing them down helps one not to forget the ones one can’t be bothered to test.
It doesn’t matter if this list is long. Maybe there really are a ton of features which simply cannot not work. Then you’d better have tested them all.
(3) For unimportant operations, ignore failure. Recently I wrote a program which writes a ZIP file. As a small extra feature, if a file in the ZIP hasn’t changed since the last time the program ran, the timestamp of the file in the new ZIP file should be the same as in the old one. This isn’t a very important feature, but it’s there. Once, when it ran, there was a file I/O problem reading the old file, and the program aborted. But this isn’t an important enough feature to abort execution: the program should have continued, and just given all files in the new ZIP a new timestamp.
Consider this when writing all code: if this goes wrong, does it matter? If not, when something goes wrong (any Throwable), log the exception and continue. You’ll kick yourself if failure of one part doesn’t matter, yet it brings down the whole program.
(4) Restart everything. You’ve only change one small piece of code, why incur the cost of restarting all Apaches and all robots? Well, software’s strange, and any change, however localized, can break any functionality. Any programmer knows this to be true. If you don’t restart everything, how will you everything still works? How will you test it?
(5) Look at the log files after releasing. Even if, after a restart of the live servers, everything seems fine, what are the users seeing? They’re testing different paths than you. If you log uncaught Exceptions, take a look at the log file before the release, and again after the restart after the release, and see if there are more errors. For example, SQL errors which weren’t there beforehand. This could alert you to a problem you’ve overlooked.
(6) Static checks are good. Programming languages such as Smalltalk and LISP popularized the notion that it’s cool to do everything, such as method lookup, at runtime. “It’s gives you more flexibility.” While this is certainly true, there are a lot of errors which you’ll then only find at runtime. (The same is true of SQL strings in program code: You will only know if you’ve misspelled a column name in the SQL when you run the particular piece of code.) This is not helpful to minimize your errors. I appreciate that taking code online which hasn’t even been run once is hardly a good idea, but I’ve seen it happen often enough.
Java and Hibernate are a good combination in this respect. If the Java program compiles then you know you’ve got all your variable names, function names, type-casts and Exception checking right. If the Hibernate program starts then you know the classes map to existing tables correctly. (But HQL, represented as strings within your program, are bad again, as you could make a spelling mistake, and it will only cause an exception when the particular code is executed.)
If one has to have SQL strings in the program, and thus an error in it will only be detected once the code path is executed, maybe a prepare of the statement can be placed in a static constructor of the class? That way at least when the class is loaded (at the start of the program’s execution, most likely) one will find out about the problem.
(7) Be aggressive about cleaning old code. The more code there is, the more complex a system is to understand. If one has a new chat system, why is code which communicates with the old chat system still there? What if that code relies on classes which you’re about to change? What if it communicates with an old chat server and the results aren’t displayed anywhere any more, and then that old chat server goes away? The motto “clean code that works” does not involve having 100k lines of old junk around, which no one understands, no one wants to take the time to learn (as it’s no longer relevant), and will break randomly.
(8) Compile all of the program. It’s obvious, before one releases a Java program, one does a “clean all” then a compile, just to check that one hasn’t changed a class and forgotten to recompile a client of it, which will result in a runtime error, e.g. a MethodNotFoundException. Why doesn’t one do the same in scripting languages? Admittedly scripting language compilers don’t check as much, but they still check some things (e.g. syntactic correctness). One “unit test” should be to go through every program file – every library, every CGI, every PHP page, and do a compile check on it.
(9) Release emails. It’s easier to delete an email which one’s not interested in, than to find out information from an email one didn’t receive and doesn’t know was ever sent. If a service like a website suddenly breaks, it’s important to fix it as soon as possible, and that probably means contacting the person who caused it to break. Before (not after) a release, write an email to all concerned – operations engineers, software developers, support agents, managers – and let them know that a change is going live.
(10) Be contactable. There’s nothing worse, for creating a perception of negativity, than when someone’s made an error, and you can’t contact them. Make sure mobiles are on loud. If you’re not reading email for some reason, make sure you’ve told everyone in advance, and set up an auto-responder. Want to be contacted less? Make fewer errors.
(11) Monitoring. For each robot and front-end program, one needs to define what acceptable conditions are and what not. E.g. what logs must be written by the correctly-running program, and which logs must not be written. Monitor them. This takes quite a lot of effort, a) the monitoring software b) defining what are acceptable and unacceptable conditions c) tuning the software to actually produce logs which are usefully monitorable. But it’s necessary. If it’s not done, there will be errors written to the logs and nobody will see them.