Wednesday, June 17, 2009

Retrying transactions with exponential backoff

The method I described in my previous post about retrying transactions could use some improvement. A deficiency that will only become relevant in highly congested servers is the constant retry interval. When two or more transactions try to commit at the same time and fail, with the code from the last post they will retry probably simultaneously again. Random runtime events (like process/thread scheduler decisions, JIT compiler invocations, etc.) might help avoid collisions, but in general the collided transactions may well collide again. And again. And then again, until the specified maximum number of retries is reached.


The general method to alleviate such problems is to randomize the retry intervals. The most well-known algorithm in this category is called exponential backoff. This is one way to implement it for the utility method in my previous post (full code in gss):


public T tryExecute(final Callable<T> command) throws Exception {
T returnValue = null;
// Schedule a Future task to call the command after delay milliseconds.
int delay = 0;
ScheduledExecutorService executor = Executors.newScheduledThreadPool(1);
for (int i = 0; i < TRANSACTION_RETRIES; i++) {
final int retry = i;
ScheduledFuture<T> future = executor.schedule(new Callable<T>() {

@Override
public T call() throws Exception {
return command.call();
}
}, delay, TimeUnit.MILLISECONDS);

try {
returnValue = future.get();
break;
} catch (ExecutionException e) {
Throwable cause = e.getCause();
if (!(cause instanceof EJBTransactionRolledbackException) ||
retry == TRANSACTION_RETRIES - 1) {
executor.shutdownNow();
throw new Exception(cause);
}
delay = MIN_TIMEOUT + (int) (MIN_TIMEOUT * Math.random() * (i + 1));
String origCause = cause.getCause() == null ?
cause.getClass().getName() :
cause.getCause().getClass().getName();
logger.info("Transaction retry #" + (i+1) + " scheduled in " + delay +
" msec due to " + origCause);
}

}
executor.shutdownNow();
return returnValue;
}
The notable changes in this version are the delayed invocations of the Callable and the retry interval calculation. The former is accomplished using a ScheduledFuture from the excellent utilities in java.util.concurrent, which gets executed after a random time period. The interval calculation is something that can be implemented in a number of ways, either with monotonically increasing intervals or not. I opted for the formula above since it provided the fastest collision elimination in my tests, much faster than the monotonically increasing interval formulas I tried. The MIN_TIMEOUT constant is more of a black art. It should be tuned to the particular hardware and network setup in order to attain maximum efficiency: the minimum number of retries with the minimum interval between retries.

Another issue raised in a comment in my previous post was that the exception I am using as guard, EJBTransactionRolledbackException, may be too generic for this purpose. This is definitely true, EJBTransactionRolledbackException just wraps an OptimisticLockException as one would expect for this case, which in turn wraps the Hibernate-specific StaleObjectStateException that is thrown initially. However, contention in a database table, index or row might not necessarily result in optimistic locking violations, but in deadlocks as well, when one transaction holds exclusive locks on resources that are needed by the other and vice versa. Deadlock detection is performed by the DBMS, which forces a rollback of both transactions. This time however the initial exception (at least in my testing, might be DBMS or JDBC driver-specific) is GenericJDBCException which gets wrapped in a PersistenceException, which in turn is put inside a EJBTransactionRolledbackException before being received by our call site. Therefore, the generic nature of the guard is not a problem in this case. On the contrary it covers more relevant cases and one might argue even that there is no better service that can be offered to the caller, than retrying in all such occasions.

1 comment:

Wladimir Tavares said...

That's funny! I'm writing a very similar piece of code for dealing with Hibernate transactions. And I remember the Ethernet Backoff algorithm, from Computer Networks classes.
Very good insight.

Creative Commons License Unless otherwise expressly stated, all original material in this weblog is licensed under a Creative Commons Attribution 3.0 License.