On writing deadlock-free and composable software

By Adrian Smith4 Feb 20181400 words7 mins to read

The trouble is, deadlock-free code and composable code are fundamentally conflicting objectives. You can only know if you have a deadlock by analyzing the entire codebase.

If you create the function updateArtEvent(..) that acquires a lock on one particular resource, and another function updateArtVenue(..) that acquires a lock on a different resource, then write one piece of code to process a user action within a transaction that calls the first then the second, and another piece of code that calls them in the opposite order, you've got a deadlock. If you just look at the implementation of the two functions in isolation, you won't be able to determine that there's a deadlock.

This isn't an easy situation, and I must admit I don't really know what the solution is.

I think the best blogs are those which tell you stuff. However alas I tend to blog about stuff which is perplexing me at the time, i.e. stuff I don't know either. If you know the solution, feel free to leave the answer in the comments :-)

Here are some potential solutions, but they both come with their own problems.

Solution: Write only: Consistent lock acquisition order

At one company, we had a number of Lucene indexes. To write to a Lucene index, you have to acquire a write-lock for the whole index. Lucene supports transactions. So for a user operation involving e.g. art events and art venues, I'd open a database transaction, open a Lucene transaction for the event index, and a Lucene transaction for the venue index. Write the necessary changes there. If all goes well, commit everything, otherwise roll back everything.

Functions would just acquire the locks as they needed them. Meaning that it often happened that a certain user action firstly acquired the event index lock, later the venue index lock, and other user actions would acquire the locks in the opposite order.

Everything was fine in dev, but the live system would get "stuck" in this deadlock situation fairly regularly. Lucene has no deadlock detection in the way that databases do, so the system would simply sit there waiting forever. The application server was up, and other requests not involving those indexes could be processed, so monitoring and restart processes didn't trigger. But those Lucene indexes were locked; and no more art events or venues could be changed until I restarted the system manually.

I noticed that, in this particular system, the Lucene indexes were written to but never queried in the same transaction. This was a piece of luck.

We had a home-written small Transaction object which simply acted as a facade against the various transactional resources such as databases, various Lucene indexes, etc. (This was intended as a "poor man's" distributed transaction; the commit() method simply did a for loop over the underlying resources and called commit() on them: the class was very small and was sufficient for our needs.)

I altered our Transaction object, to change the way Lucene index writes worked.

Deadlocks happen if resources are acquired in different orders. Deadlock can't happen if you acquire the locks in a consistent order. If multiple sessions first want lock A and then later lock B, the later sessions will just wait for the first session. A bit of waiting, but no deadlock.

Altering our Transaction object to do this processing meant that the individual logic classes could still "write" to whichever indexes they wanted at whatever time they wanted. Individual functions did not need to know about each other, and composing code could run the functions in any order. The code was still composable.

But this approach only works if you don't want to read the information that you've just written to such an index within the session doing the writing. So this isn't a solution for every situation.

But while altering our 20 line Transaction class to add a few lines to maintain certain writes in a List was easy enough if you wrote the Transaction class yourself, it might not be so easy, or indeed possible at all, if you're using a huge framework :-)

Solution: Lock escalation

Normally fine-grained locking is considered preferable to course-grained locking. For example MySQL InnoDB introduced row-level locking, vs. the previous MyISAM table-level locking. The finer-grained row-level locking meant that if e.g. two user accounts were changed, if they were in different rows, they could changed in parallel, rather than one change waiting for the other to finish via the table-level lock.

The definition of deadlock requires there to be multiple resources in play, and for them to be acquired in a different order. So the irony is that course-grained do have an advantage, in that the fewer locks there are, the lower chance of deadlock. If there is only one resource, there is only one order in which to acquire the resource, therefore there can be no deadlock.

So one option would be to acquire a sort of "global lock" before doing any write operation to any of these resources which have the capability to have a deadlock.

This reduces concurrency to a single write operation across the whole system. That is far from ideal. However, having a system which crashes due to deadlocks is even worse. It's better to have a slow working system, than a system that doesn't work.

The "global lock" has to be in a central place which every part of the system can access. A lock within your software e.g. JVM is not sufficient, as there may be multiple instances of your program running, if you decide to scale it.

A "named lock" in a database is ideal for this purpose. Databases allow you to not only lock rows with e.g. "select for update", but also lock anything else you can name. PostgreSQL calls these Advisory Locks. The temporal scope of the lock should be the scope of changing things, which is the transaction.

You want to the "lock name" to be unique within your application (again, so you don't have to worry about coordinating lock names between different independent functions). For example the full Java class name getClass().getName(). (PostgreSQL only allows the user of numbers, not names, so e.g. getClass().getName().hashCode() will give you a sufficiently unique number for the lock.)

P.S. The situation gets even worse, in that it's not really possible to determine, in advance, if your system is going to suffer from deadlocks or not. Tests on individual functions will not find deadlocks, as the individual functions do not cause deadlocks until they're composed. Even a test on a single process, composing multiple individual functions, will not cause a deadlock, as there is no resource contention. Even testing a single process multiple times in parallel will not find a deadlock, as a single composition will acquire the locks in a consistent order, so again, no deadlock.

This article was written by Adrian Smith on 4 Feb 2018

Follow me: Facebook | Twitter | Email

More on: Software Architecture | Coding