Handle errors gracefully by offering idempotent APIs

In the REST specification, there is no way to group individual requests together into a transaction (in the way that it's possible to group multiple modifications to a database into a database transaction, or in a distributed J2EE system use a distributed transaction). Making all APIs idempotent goes some way to compensating for this shortcoming.

Motivation

A client may well want to access more than one REST API to carry out an action on behalf of the user. For example, if the user buys a product, they may wish to charge the user via a payment API, and then create the access rights to the product: both via REST APIs to different services.

Clearly it is imperative, that either both of these actions succeed, or neither of them succeed. For only the payment API to succeed would result in the user paying for nothing, for only the access rights API to succeed would mean the user gets the product for free.

Further, it is also imperative that, in the case the actions succeed, they succeed only once. If the payment occurred twice the user would have paid double; if the access rights occurred twice the user would have access to one product for free.

Whichever order the actions are performed in, there is the possibility that the first will succeed and the second will fail.

In REST APIs, where no facility of grouping different calls into a transaction is available, there is no facility to rollback the first call if the second call fails. Therefore, the second call must be made to succeed.

However, in order to make the second call succeed (and succeed only once), it is not clear whether it must be retried:

If the request failed, it must be retried.
If the request succeeded, but the delivery of its response failed, the call must not be repeated.

In the case of a network failure, the client may have received an error from the second service like “connection reset by peer”. With such an error, it is not possible to determine if the request failed, or the response failed. So, based on this information which the client cannot know, they either must or must not retry the call.

Solution

The solution comes from realizing that such a traditional “insert access rights” call could actually be more verbosely, but more accurately, named thus:

I want the access rights for this user to this product to exist afterwards, and I know that there are no access rights beforehand, therefore I request that brand new access rights be added to the database.

In fact the client is asserting too much: the client need not care if the access rights exists beforehand, it cares only that the access rights exists afterwards. The scope of the call can be reduced to the second half, simply “after the call, there must be access rights”.

For the server to be able to perform such a call—to create access rights only if they doesn't already exist—the server needs some way to identify if the access rights already exists. Thus, rather than the classic “create and return server-generated ID”, the client generates an identifier for the access rights, and includes it in the request. The call then becomes:

Create or update the access rights (with the supplied client-supplied identifier)

The normal reason for using traditional “create, and return ID” calls is that, for IDs based on sequential numbers, only the server can allocate a number and guarantee that it is unique. This problem is solved by not using sequential numbers, and instead using UUIDs as identifiers for objects. UUIDs are specifically designed to be unique values created by systems which know nothing of each other, and which may even be offline at the time. UUIDs combine various factors such as random numbers, the current time, Mac addresses, etc., to create unique values which don't collide, without clients having to communicate with one another or rely on a central authority.

(As an additional bonus, two calls in traditional systems, “create” and “update”, become one call in such an idempotent system, saving on documentation overhead.)

If the client performs the sequence “charge user” and “create or edit access rights (with the client-supplied UUID)”, and the second fails with a network error, the solution to the above conundrum is that client must retry the call.

If the access rights don't exist (because the previous request failed), the access rights will be created.
If the access rights exists (because the previous request succeeded, but the delivery of its response failed) the access rights will be updated. (In case the call is a retry, this is effectively a “no operation”.)

Similarly to “create or edit object with this UUID”, the “delete” call is no longer an assertion by the client “I believe this object exists before the call; afterwards it should no longer exist”, where an error “404 object doesn't exist” would be reasonable possible response if the object did not exist. It is, analogous to “create or edit”, replaced with a call making an assertion only about the final desired state. It becomes “after the call, there is no object with this identifier”. If the object exists, it is deleted by the server. If it doesn't exist, nothing happens. If such a delete call is part of a sequence of actions by the client, the client can retry this call in case it fails due to a network problem.

Notes

In calls such as “create or edit”, the word “or” is used to have its meaning from the C programming language and its derivatives. Programmers familiar with this class of language understand the expression “A or B” to mean “try A, and try B only if A doesn't succeed”. In these languages, the “or” used in this sense is called a “short-circuit” operator.

In terms of HTTP verbs, offer only the verbs GET (for reading), PUT (for asserting an object should exist after the call) and DELETE (for asserting the object should not exist after the call). These calls are all idempotent according to the HTTP 1.1 specification section 9.1.2. Do not offer any calls with the verb POST, as this verb is used for creating new objects, where the client has no facility for determining if it succeeded in the case of network failure: it is not idempotent, so do not use it.

As discussed, the solution to calling multiple APIs which may fail is to retry idempotent API calls. Therefore, the client must be in a position where it can reasonably retry them. Interactive applications such as an iPhone app, or a backend responding to API calls from an interactive application, must deliver success or error within a reasonable amount of time. They are unable to retry failed calls. Therefore, such interactive systems, or backends to interactive systems, should add such requests to a queue. The system to process entries on the queue will have the facility to retry the request at a later time.

It is worth noting that this algorithm only works in the case of temporary errors such as network problems, server unavailability, etc. For the class of errors permanent errors, for example that the user has insufficient funds to purchase the product, a retry will not lead to success. For permanent errors, it is imperative that the client either checks all such conditions which could lead to permanent errors in advance, or simulates a rollback by calling other idempotent APIs, for example “delete access rights of product” in case the access rights has already been carried out but payment has permanently failed.

Alternative approaches

Alternative implementations of idempotent APIs have the client specify a “request ID” with each call. The server stores a log of all requests along with this “request ID”. The API software, for each call, looks in the log to find the “request ID”. If present, it does not perform the action, but simply returns the stored result. In this case “provision, and return server-generated ID” can be implemented in a non-idempotent way, and “made” idempotent by a software wrapper recording the responses together with the “request ID”. However, prefer not introducing additional artificial IDs unless necessary.

J2EE implementations, which are not REST APIs, pass a hidden extra parameter when performing a remote procedure call, they pass a global transaction ID. All conforming J2EE implementations, even across network boundaries, even from different J2EE server vendors, can communicate, and can either rollback all their work together, or commit all their work together. This happens via the “two-phase commit” algorithm. Although this is a clearly superior approach to idempotent REST APIs from a purely technical perspective, it requires all parties have a J2EE-compliant environment. Do not impose such restrictions on clients; clients should be free to use PHP or node.js or whatever system they wish.

There are alternative ways to implement distributed transactions, without using J2EE, for example Spring XA transactions, however, they also require all clients be using a compatible system, which is, as above, too restrictive.

P.S. I recently created a nerdy privacy-respecting tool called When Will I Run Out Of Money? It's available for free if you want to check it out.