Friday, 21 August 2009

Distribution instead of Buddy Replication

People have often commented on Buddy Replication (from JBoss Cache) not being available in Infinispan, and have asked how Infinispan’s far superior distribution mode works. I’ve decided to write this article to discuss the main differences from a high level. For deeper technical details, please visit the Infinispan wiki.

Scalability versus high availability These two concepts are often at odds with one another, even though they are commonly lumped together. What is usually good for scalability isn’t always good for high availability, and vice versa. When it comes to clustering servers, high availability often means simply maintaining more copies, so that if nodes fail - and with commodity hardware, this is expected - state is not lost. An extreme case of this is replicated mode, available in both JBoss Cache and Infinispan, where each node is a clone of its neighbour. This provides very high availability, but unfortunately, this does not scale well. Assume you have 2GB per node. Discounting overhead, with replicated mode, you can only address 2GB of space, regardless of how large the cluster is. Even if you had 100 nodes - seemingly 200GB of space! - you’d still only be able to address 2GB since each node maintains a redundant copy. Further, since every node needs a copy, a lot of network traffic is generated as the cluster size grows.

Enter Buddy Replication Buddy Replication (BR) was originally devised as a solution to this scalability problem. BR does not replicate state to every other node in the cluster. Instead, it chooses a fixed number of 'backup' nodes and only replicates to these backups. The number of backups is configurable, but in general it means that the number of backups is fixed. BR improved scalability significantly and showed near-linear scalability with increasing cluster size. This means that as more nodes are added to a cluster, the space available grows linearly as does the available computing power if measured in transactions per second.

But Buddy Replication doesn’t help everybody! BR was specifically designed around the HTTP session caching use-case for the JBoss Application Server, and heavily optimised accordingly. As a result, session affinity is mandated, and applications that do not use session affinity can be prone to a lot of data gravitation and 'thrashing' - data is moved back and forth across a cluster as different nodes attempt to claim 'ownership' of state. Of course this is not a problem with JBoss AS and HTTP session caching - session affinity is recommended, available on most load balancer hardware and/or software, is taken for granted, and is a well-understood and employed paradigm for web-based applications.

So we had to get better Just solving the HTTP session caching use-case wasn’t enough. A well-performing data grid needs to to better, and crucially, session affinity cannot be taken for granted. And this was the primary reason for not porting BR to Infinispan. As such, Infinispan does not and will not support BR as it is too restrictive.

Distribution Distribution is a new cache mode in Infinispan. It is also the default clustered mode - as opposed to replication, which isn’t scalable. Distribution makes use of familiar concepts in data grids, such as consistent hashing, call proxying and local caching of remote lookups. What this leads to is a design that does scale well - fixed number of replicas for each cache entry, just like BR - but no requirement for session affinity.

What about co-locating state? Co-location of state - moving entries about as a single block - was automatic and implicit with BR. Since each node always picked a backup node for all its state, one could visualize all of the state on a given node as a single block. Thus, colocation was trivial and automatic: whatever you put in Node1 will always be together, even if Node1 eventually dies and the state is accessed on Node2. However, this meant that state cannot be evenly balanced across a cluster since the data blocks are very coarse grained. With distribution, colocation is not implicit. In part due to the use of consistent hashing to determine where each cached entry resides, and also in part due to the finer-grained cache structure of Infinispan - key/value pairs instead of a tree-structure - this leads to individual entries as the granularity of state blocks. This means nodes can be far better balanced across a cluster. However, it does mean that certain optimizations which rely on co-location - such as keeping related entries close together - is a little more tricky.

One approach to co-locate state would be to use containers as values. For example, put all entries that should be colocated together into a HashMap. Then store the HashMap in the cache. But that is coarse-grained and ugly as an approach, and will mean that the entire HashMap would need to be locked and serialized as a single atomic unit, which can be expensive if this map is large.

Another approach is to use Infinispan’s AtomicMap API. This powerful API lets you group entries together, so they will always be colocated, locked together, but replication will be much finer-grained, allowing only deltas to the map to be replicated. So that makes replication fast and performant, but it still means everything is locked as a single atomic unit. While this is necessary for certain applications, it isn’t always be desirable.

One more solution is to implement your own ConsistentHash algorithm - perhaps extending DefaultConsistentHash. This implementation would have knowledge of your object model, and hashes related instances such that they are located together in the hash space. By far the most complex mechanism, but if performance and co-location really is a hard requirement then you cannot get better than this approach.

In summary:

Buddy Replication

Near-linear scalability
Session affinity mandatory
Co-location automatic
Applicable to a specific set of use cases due to the session affinity requirement

Distribution

Near-linear scalability
No session affinity needed
Co-location requires special treatment, ranging in complexity based on performance and locking requirements. By default, no co-location is provided
Applicable to a far wider range of use cases, and hence the default highly scalable clustered mode in Infinispan

Hopefully this article has sufficiently interested you in distribution, and has whetted your appetite for more. I would recommend the Infinispan wiki which has a wealth of information including interactive tutorials and GUI demos, design documents and API documentation. And of course you can’t beat downloading Infinispan and trying it out, or grabbing the source code and looking through the implementation details.

Cheers Manik

Posted by Manik Surtani on 2009-08-21
Tags: buddy replication partitioning distribution

Friday, 21 August 2009

Defining cache configurations via CacheManager in Beta1

Infinispan’s first beta release is just around the corner and in preparation, I’d like to introduce to the Infinispan users an important API change in org.infinispan.manager.CacheManager class that will be part of this beta release.

As a result of the development of the Infinispan second level cache provider for Hibernate, we have discovered that the CacheManager API for definition and retrieval of Configuration instances was a bit limited. So, for this coming release, the following method has been deleted:

void defineCache(String cacheName, Configuration configurationOverride)

And instead, the following two methods have been added:

Configuration defineConfiguration(String cacheName, Configuration configurationOverride);
Configuration defineConfiguration(String cacheName, String templateCacheName,
   Configuration configurationOverride);

The primary driver for this change has been the development of the Infinispan cache provider, where we wanted to enable users to configure or override most commonly modified Infinispan parameters via hibernate configuration file. This would avoid users having to modify different files for the most commonly modified parameters, hence improving usability of the Infinispan cache provider. However, to be able to implement this, we needed CacheManager’s API to be enhanced so that:

Existing defined cache configurations could be overriden. This enables use cases like this: Sample Infinispan cache provider configuration will contain a generic cache definition to be used for entities. Via hibernate configuration file, users could redefine the maximum number of entries to be allowed before eviction kicks in for all entities. The code would look something like this:

// Assume that 'cache-provider-configs.xml' contains
// a named cache for entities called 'entity'
CacheManager cacheManager = new DefaultCacheManager(
   "/home/me/infinispan/cache-provider-configs.xml");
Configuration overridingConfiguration = new Configuration();
overridingConfiguration.setEvictionMaxEntries(20000); // max entries to 20.000
// Override existing 'entity' configuration so that eviction max entries are 20.000.
cacheManager.defineConfiguration("entity", overridingConfiguration);

Be able to define new cache configurations based on the configuration of a given cache instance, optionally applying some overrides. This enables uses cases like the following: A user wants to define eviction wake up interval for a specific entity which is different to the wake up interval used for the rest of entities.

// Assume that 'cache-provider-configs.xml' contains
// a named cache for entities called 'entity'
CacheManager cacheManager = new DefaultCacheManager(
   "/home/me/infinispan/cache-provider-configs.xml");
Configuration overridingConfiguration = new Configuration();
// set wake up interval to 240 seconds
overridingConfiguration.setEvictionWakeUpInterval(240000L);
// Create a new cache configuration for com.acme.Person entity
// based on 'entity' configuration, overriding the wake up interval to be 240 seconds
cacheManager.defineConfiguration("com.acme.Person", "entity", overridingConfiguration);

Another limitation of the previous API, which we’ve solved with this API change, is that in the past the only way to get a cache’s Configuration required the cache to be started because the only way to get the Configuration instance was from the Cache API. However, with this API change, we can now retrieve a cache’s Configuration instance via the CacheManager API. Example:

// Assume that 'cache-provider-configs.xml' contains
// a named cache for entities called 'entity'
CacheManager cacheManager = new DefaultCacheManager(
   "/home/me/infinispan/cache-provider-configs.xml");
// Pass a brand new Configuration instance without overrides
// and it will return the given cache name's Configuration
Configuration entityConfiguration = cacheManager.defineConfiguration("entity",
new Configuration());

If you would like to provide any feedback to this post, either respond to this blog entry or go to Infinispan’s user forums.

Posted by Galder Zamarreño on 2009-08-21
Tags: configuration

Wednesday, 12 August 2009

Coalesced Asynchronous Cache Store

As we prepare for Infinispan’s beta release, let me introduce to you one of the recent enhancements implemented which improves the way the current asynchronous (or write-behind) cache store works.

Right until now, the asynchronous cache store simply queued modifications, while a set of threads would apply them. However, if the queue contained N put operations on the same key, these threads would apply each and every modification one after the other, which is not very efficient.

Thanks to the excellent feedback from the Infinispan community, we’ve now improved the asynchronous cache store so that it coalesces changes and only applies the latest modification on a key. So, if N put operations on the same key are queued, only the last modification will be applied to the cache store.

Internally, the asynchronous concurrent queueing mechanism used performs in O(1) by keeping an map with the latest values for each key. So, this maps acts like the queue but there’s a not a need for a queue as such, we only care about making sure the latest values are stored hence, order is not important.

Note that the way threads apply these modifications is that they start working as soon as there are any changes available and so to see these changes coalesced, the system needs to be relatively busy or a lot of changes on the same key need to happen in a relatively short period of time. We could have made these threads work periodically, i.e. every X seconds, but by doing that, we would be letting modifications pile up and the time between operations and the cache store updates would go up, hence increasing the chance that the cache store is outdated.

Finally, there’s no configuration modifications required to get the asynchronous cache store to work in the coalesced way, it just works like this out-of-the-box. Example:

<?xml version="1.0" encoding="UTF-8"?>
<infinispan xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:infinispan:config:4.0">
  <namedCache name="persistentCache">
     <loaders passivation="false" shared="false" preload="true">
        <loader class="org.infinispan.loaders.file.FileCacheStore" fetchPersistentState="true" ignoreModifications="false" purgeOnStartup="false">
           <properties>
              <property name="location" value="/tmp"/>
           </properties>
           <async enabled="true" threadPoolSize="10"/>
        </loader>
     </loaders>
  </namedCache>
</infinispan>

Posted by Galder Zamarreño on 2009-08-12
Tags: cache stores asynchronous

Thursday, 30 July 2009

More JUG talks

Adrian Cole and I recently presented at a JUGs in Krakow and Dublin, on Infinispan and JClouds.

Krakow was great - and hot: 39 degree weather! (Don’t we just love global warming?) Anna Kolodziejczyk at Proidea (who also organises JDD) organised the event, which attracted an excited and interactive audience of about 40 people. Dublin, beautiful as always, was a polar opposite in terms of weather: a cloudy, windy and wet day, living up to its reputation. Luan O’Carroll of DUBJUG organised the event, at the plush Odessa Club. Again, an inquisitive audience of about 35 people attended, with a lot of questions on the future of data storage in clouds. Thomas Diesler of JBoss OSGi fame made a surprise guest appearance too.

In general, the talks have been very well received and have provoked thought and discussion. As requested by many, I will soon be recording a podcast of this talk and will make it available on this blog.

Apart from a JUG I hope to organise soon in London, the next time I speak about Infinispan publicly will be at JBoss World in September. Hope to see you there!

Cheers Manik

Posted by Manik Surtani on 2009-07-30
Tags: JUGs

Monday, 27 July 2009

Increase transactional throughput with deadlock detection

Deadlock detection is a new feature in Infinispan. It is about increasing the number of transactions that can be concurrently processed. Let’s start with the problem first (the deadlock) then discuss some design details and performance.

So, the by-the-book deadlock example is the following:

Transaction one (T1) performs following operation sequence: (write key_1,write key_2)
Transaction two (T2) performs following sequence: (write key_2, write key_1).

Now, if the T1 and T2 happen at the same time and both have executed first operation, then they will wait for each other virtually forever to release owned locks on keys. In the real world, the waiting period is defined by a lock acquisition timeout (LAT) - which defaults to 10 seconds - that allows the system to overcome such scenarios and respond to the user one way (successful) or the other(failure): so after a period of LAT one (or both) transaction will rollback, allowing the other to continue working.

Deadlocks are bad for both system’s throughput and user experience. System throughput is affected because during the deadlock period (which might extend up to LAT) no other thread will be able to update neither key_1 nor key_2. Even worse, access to any other keys that were modified by T1 or T2 will be similarly restricted. User experience is altered by the fact that the call(s) will freeze for the entire deadlock period, and also there’s a chance that both T1 and T2 will rollback by timing out.

As a side note, in the previous example, if the code running the transactions would(and can) enforce any sort of ordering on the keys accessed within the transaction, then the deadlock would be avoided. E.g. if the application code would order the operation based on the lexicographic ordering of keys, both T1 and T2 would execute the following sequence: (write key_1,write key_2), and so no deadlock would result. This is a best practice and should be followed whenever possible. Enough with the theory! The way Infinispan performs deadlock detection is based on an algorithm designed by Jason Greene and Manik Surtani, which is detailed here. The basic idea is to split the LAT in smaller cycles, as it follows:

lock(int lockAcquisitionTimeout) {
while (currentTime < startTime + timeout) {
 if (acquire(smallTimeout)) break;
 testForDeadlock(globalTransaction, key);
}
}

What testForDeadlock(globalTransaction, key) does is check weather there is another transaction that satisfies both conditions:

holds a lock on key and
intends to lock on a key that is currently called by this transaction.

If such a transaction is found then this is a deadlock, and one of the running transactions will be interrupted: the decision of which transaction will interrupt is based on coin toss, a random number that is associated with each transaction. This will ensure that only one transaction will rollback, and the decision is deterministic: nodes and transactions do not need to communicate with each other to determine the outcome.

Deadlock detection in Infinispan works in two flavors: determining deadlocks on transactions that spread over several caches and deadlock detection in transactions running on a single(local) cache.

Let’s see some performance figures as well. A class for benchmarking performance of deadlock detection functionality was created and can be seen here. Test description (from javadoc):

We use a fixed size pool of keys (KEY_POOL_SIZE) on which each transaction operates. A number of threads (THREAD_COUNT) repeatedly starts transactions and tries to acquire locks on a random subset of this pool, by executing put operations on each key. If all locks were successfully acquired then the tx tries to commit: only if it succeeds this tx is counted as successful. The number of elements in this subset is the transaction size (TX_SIZE). The greater transaction size is, the higher chance for deadlock situation to occur. On each thread these transactions are being repeatedly executed (each time on a different, random key set) for a given time interval (BENCHMARK_DURATION). At the end, the number of successful transactions from each thread is cumulated, and this defines throughput (successful tx) per time unit (by default one minute).

Disclaimer: The following figures are for a scenario especially designed to force very high contention. This is not typical, and you shouldn’t expect to see this level of increase in performance for applications with lower contention (which most likely is the case). Please feel free tune the above benchmark class to fit the contention level of your application; sharing your experience would be very useful!

Following diagram shows the performance degradation resulting from running the deadlock detection code by itslef in a scenario where no contention/deadlocks are present. http://2.bp.blogspot.com/_ISQfVF8ALAQ/Sm2h_re8qKI/AAAAAAAABqA/bsNgEyCkcYw/s1600-h/DLD_replicated.JPG[] Some clues on when to enable deadlock detection. A high number of transaction rolling back due to org.infinispan.util.concurrent.TimeoutException is an indicator that this functionality might help. TimeoutException might be caused by other causes as well, but deadlocks will always result in this exception being thrown. Generally, when you have a high contention on a set of keys, deadlock detection may help. But the best way is not to guess the performance improvement but to benchmark and monitor it: you can have access to statistics (e.g. number of deadlocks detected) through JMX, as it is exposed via the DeadlockDetectingLockManager MBean.