Tuesday, 12 May 2009

Implementing a performant, thread-safe ordered data container

To achieve efficient ordering of entries in the DataContainer interface for configurations that support eviction, there was a need for a linked HashMap implementation that was thread-safe and performant. Below, I specifically discuss the implementations of the FIFODataContainer and LRUDataContainer in Infinispan 4.0.x. Wherever this document references FIFODataContainer, this also applies to LRUDataContainer - which extends FIFODataContainer. The only difference is that LRUDataContainer updates links whenever an entry is visited as well as added.

After analysing and considering a few different approaches, the one I settled on is a subset of the algorithms described by H. Sundell and P. Tsigas in their 2008 paper titled Lock-Free Deques and Doubly Linked Lists, combined with the approach used by Sun’s JDK6 for reference marking in ConcurrentSkipListMap's implementation.

Reference marking? What’s that?

Compare-and-swap (CAS) is a common technique today for atomically updating a variable or a reference without the use of a mutual exclusion mechanism like a lock. But this only works when you modify a single memory location at a time, be it a reference or a primitive. Sometimes you need to atomically update two separate bits of information in a single go, such as a reference, as well as some information about that reference. Hence reference marking. In C, this is sometimes done by making use of the assumption that an entire word in memory is not needed to store a pointer to another memory location, and some bits of this word can be used to store additional flags via bitmasking. This allows for atomic updates of both the reference and this extra information using a single CAS operation.

This is possible in Java too using AtomicMarkableReference, but is usually considered overly complex, slow and space-inefficient. Instead, what we do is borrow a technique from ConcurrentSkipListMap and use an intermediate, delegating entry. While this adds a little more complexity in traversal (you need to be aware of the presence of these marker entries when traversing the linked list), this performs better than an AtomicMarkableReference.

In this specific implementation, the 'extra information' stored in a reference is the fact that the entry holding the reference is in the process of being deleted. It is a common problem with lock-free linked lists when you have concurrent insert and delete operations that the newly inserted entry gets deleted as well, since it attaches itself to the entry being concurrently deleted. When the entry to be removed marks its references, however, this makes other threads aware of the fact and cause CAS operations on the reference to fail and retry.

Performance

Aside from maintaining order of entries and being thread-safe, performance was one of the other goals. The target is to achieve constant-time performance - O(1) - for all operations on DataContainer.

The clhttp://3.bp.blogspot.com/_ca0W9t-Ryos/SgGwA2iw_vI/AAAAAAAAAKA/TpVMWo2Rq9U/s1600-h/FIFODataContainer.jpeg[image]ass diagram (click to view in full-size) depicts the FIFODataContainer class. At its heart the FIFODataContainer mimics a JDK ConcurrentHashMap (CHM), making use of hashtable-like lockable segments. Unlike the segments in CHM, however, these segments are simpler as they support a much smaller set of operations.

Retrieving data from the container The use of segments allow for constant-time thread-safe get() and containsKey() operations. Iterators obtained from the DataContainer - which implements Iterable, and hence usable in for-each loops - and keySet() are immutable, thread-safe and efficient, using traversal of the linked list - making use of getNext() and getPrev() helpers. See below for details. Traversal is efficient and constant-time.

Updating the container When removing an entry, remove() locks the segment in question, removes the entry, and unlinks the entry. Both operations are thread-safe and constant-time. Locking the segment and removing the entry is pretty straightforward. Unlinking involves marking references, and then an attempt at CAS’ing next and previous references to bypass the removed entry. Order here is important - updates to the next reference needs to happen first, read on for more details as to why.

When performing a put(), the entry is created, segment locked and entry inserted into the segment. The entry is then inserted at the tail of the linked list. Again, both operations are thread-safe and constant-time. Linking at the tail involves careful CAS’ing of references on the new entry, the tail dummy entry and the former last entry in the list.

Maintaining a lock-free, concurrent doubly linked list It is important to note that the entries in this implementation are doubly linked. This is critical since, unlike the JDK’s ConcurrentSkipListMap, we use a hashtable to look up entries to be removed, to achieve constant time performance in lookup. Locating the parent entry to update a reference needs to be constant-time as well, and hence the need for a previous reference. Doubly-linked lists make things much trickier though, as there two references to update atomically (yes, that sounds wrong!)

Crucially, what we do not care about - and do not support - is reverse-order traversal. This means that we only really care about maintaining accuracy in the forward direction of the linked list, and treat the previous reference as an approximation to an entry somewhere behind the current entry. Previous references can then be corrected - using the correctPrev() helper method described below - to locate the precise entry behind the current entry. By placing greater importance on the forward direction of the list, this allows us to reliably CAS the forward reference even if the previous reference CAS fails. It is hence critical that whenever any references are updated, the next reference is CAS’d first, and only on this success the previous reference CAS is attempted. The same order applies with marking references. Also, it is important that any process that touches an entry that observes that the next pointer is marked but the previous pointer is not, attempts to mark the previous pointer before attempting any further steps.

The specific functions we need to expose, to support DataContainer operations, are:

void linkAtEnd(LinkedEntry entry);
  
  void unlink(LinkedEntry entry);
  
  LinkedEntry correctPrev(LinkedEntry suggestedPrev, LinkedEntry current);
  
  LinkedEntry getNext(LinkedEntry current);
  
  LinkedEntry getPrev(LinkedEntry current);

These are exposed as protected final methods, usable by FIFODataContainer and its subclasses. The implementations themselves use a combination of CAS’s on a LinkedEntry’s next and previous pointers, marking references, and helper correction of previous pointers when using getNext() and getPrevious() to traverse the list. Note that it is important that only the last two methods are used when traversing rather than directly accessing a LinkedEntry’s references - since markers need to be stepped over and links corrected.

Please refer to Lock-Free Deques and Doubly Linked Lists for details of the algorithm steps.


Sunday, 10 May 2009

jclouds-s3 beta released

image

jclouds-s3 is the glue between Infinispan and Amazon S3. jclouds provides a cachestore plugin that allows you to persist your infinispan cluster to S3.

Over the last few months, jclouds has evolved with Infinispan’s concurrent context.

Under the hood, jclouds is made for infinispan. Its non-blocking engine and FutureCommand design were crafted to meet the challenge of Infinispan’s grueling integration tests.

jclouds-s3 is also quite user-friendly, exposing services to S3 in a simple Map interface.

As we are now in beta, please do try out jclouds-s3 and let us know where we can improve. If you have some spare cycles, feel free to lend a hand :)

Regardless, we hope you enjoy the product.


Saturday, 02 May 2009

Keep it in the cloud

Corporate slaves often spend months of paper work only to find their machine obsolete before its powered on.   Forward thinking individuals need playgrounds to try out new ideas.  Enterprise 2.0 projects have to scale with their user base.  In short, there’s a lot of demand for flexible infrastructure.  Cloud infrastructure is one way to fill that order.

One popular cloud infrastructure provider is Amazon EC2.  EC2 is basically a pay-as-you-go datacenter.  You pay for CPU, storage, and network resources.  Using the open-source Infinispan data grid, you have a good chance of linear performance as your application needs change.  Using EC2, you can instantly bring on hosts to support that need.  Great match, right?  What’s next?

Assuming your data is important, you will need to persist your Infinispan cluster somewhere.   That said, Amazon charges you for traffic that goes in and out of their network…​ this could get expensive.  So, the next bit is controlling these costs. 

Amazon offers a storage service called S3.  Transferring data between EC2 and S3 is free; you only pay for data parking.  In short, there is a way to control these I/O costs: S3. 

Infinispan will save your cluster to S3 when configured with its high-performance JClouds plug-in.  You specify the S3 Bucket and your AWS credentials and Infinispan does the rest.

In summary, not only does Infinispan shred your license costs, but we also help cut your persistence costs, too!

So, go ahead: Keep it in the cloud!


Friday, 01 May 2009

Interview about Infinispan on OpenSourceReleaseFeed

OpenSourceReleaseFeed have just published an interview with yours truly on Infinispan. Check it out here!

Cheers Manik


Tuesday, 28 April 2009

Infinispan: the Start of a New Era in Open Source Data Grids

Over the past few months we’ve been flying under the radar preparing for the launch of a new, open source, highly scalable distributed data grid platform. We’ve finally got to a stage where we can announce it publicly and I would like to say that Infinispan is now ready to take on the world!

The way we write computer software is changing. The demise of the Quake Rule has made hardware manufacturers cram more cores on a CPU, more CPUs in a server. To achieve the levels of throughput and resilience that modern applications demand, compute grids are becoming increasingly popular. All this serves to exacerbate existing database bottlenecks; hence the need for a data grid platform.

imageSo why is Infinispan sexy?

  1. Massive heap - If you have 100 blade servers, and each node has 2GB of space to dedicate to a replicated cache, you end up with 2 GB of total data. Every server is just a copy. On the other hand, with a distributed grid - assuming you want 1 copy per data item - you get a 100 GB memory backed virtual heap that is efficiently accessible from anywhere in the grid. Session affinity is not required, so you don’t need fancy load balancing policies. Of course you can still use them for further optimisation. If a server fails, the grid simply creates new copies of the lost data, and puts them on other servers. This means that applications looking for ultimate performance are no longer forced to delegate the majority of their data lookups to a large single database server - that massive bottleneck that exists in over 80% of enterprise applications!

  2. Extreme scalability - Since data is evenly distributed, there is essentially no major limit to the size of the grid, except group communication on the network - which is minimised to just discovery of new nodes. All data access patterns use peer-to-peer communication where nodes directly speak to each other, which scales very well.

  3. Very fast and lightweight core - The internal data structures of Infinispan are simple, very lightweight and heavily optimised for high concurrency. Early benchmarks have indicated 3-5 times less memory usage, and around 50% better CPU performance than the latest and greatest JBoss Cache release. Unlike other popular, competing commercial software, Infinispan scales when there are many local threads accessing the grid at the same time. Even though non-clustered caching (LOCAL mode) is not its primary goal, Infinispan still is very competitive here.

  4. Not Just for Java (PHP, Python, Ruby, C, etc.) - The roadmap has a plan for a language-independent server module. This will support both the popular memcached protocol - with existing clients for almost every popular programming language - as well as an optimised Infinispan-specific protocol. This means that Infinispan is not just useful to Java. Any major website or application that wants to take advantage of a fast data grid will be able to do so.

  5. Support for Compute Grids - Also on the roadmap is the ability to pass a Runnable around the grid. You will be able to push complex processing towards the server where data is local, and pull back results using a Future. This map/reduce style paradigm is common in applications where a large amount of data is needed to compute relatively small results.

  6. Management is key! - When you start thinking about running a grid on several hundred servers, management is no longer an extra, it becomes a necessity. This is on Infinispan’s roadmap. We aim to provide rich tooling in this area, with many integration opportunities.

  7. Competition is Proprietary - All of the major, viable competitors in the space are not open-source, and are very expensive. Enough said. :-)

What are data grids?http://www.arcatoglobal.com/images/ag_serverfarm.jpg[image]

Data grids are, to put it simply, highly concurrent distributed data structures. Data grids typically allow you to address a large amount of memory and store data in a way that it is quick to access. They also tend to feature low latency retrieval, and maintain adequate copies across a network to provide resilience to server failure.

As such, at its core, Infinispan presents a humble data structure. But this is also a high specialised data structure, tuned to and geared for a great degree of concurrency - especially on multi-CPU/multi-core architectures. Most of the internals are essentially lock- and synchronization-free, favouring state-of-the-art non-blocking algorithms and techniques wherever possible. This translates to a data structure that is extremely quick even when it deals with a large number of concurrent accesses.

Beyond this, Infinispan is also a distributed data structure. It farms data out across a cluster of in-memory containers. It does so with a configurable degree of redundancy and various parameters to tune the performance-versus-resilience trade-off. Local "L1" caches are also maintained for quick reads of frequently accessed data.

Further, Infinispan supports JTA transactions. It also offers eviction strategies to ensure individual nodes do not run out of memory and passivation/overflow to disk. Warm-starts using preloads are also supported.

JBoss Cache and Infinispan

So where does Infinispan stand against the competition? Let’s start with JBoss Cache. It is no surprise that there are many similarities between JBoss Cache and Infinispan, given that they share the same minds! Infinispan is an evolution of JBoss Cache in that it borrows ideas, designs and some code, but for all practical purposes it is a brand new project and a new, much more streamlined codebase.

JBoss Cache has evolved from a basic replicated tree structure to include custom, high performance marshalling (in version 1.4), Buddy Replication (1.4), a new simplified API (2.X), high concurrency MVCC locking (3.0.X) and a new non-blocking state transfer mechanism (3.1.X). These were all incremental steps, but it is time for a quantum leap.

Hence Infinispan. Infinispan is a whole new project - not just JBoss Cache 4.0! - because it is far wider in scope and goals - not to mention target audience. Here are a few points summarising the differences:

  • JBoss Cache is a clustering library. Infinispan’s goal is to be a data grid platform, complete with management and migration tooling.

  • JBoss Cache’s focus has been on clustering, using replication. This has allowed it to scale to several 10s (occasionally even over 100) nodes. Infinispan’s goals are far greater - to scale to grids of several 100’s of nodes, eventually exceeding 1000’s of nodes. This is achieved using consistent hash based data distribution.

  • Infinispan’s data structure design is significantly different to that of JBoss Cache. This is to help achieve the target CPU and memory performance. Internally, data is stored in a flat, map-like container rather than a tree. That said, a tree-like compatibility layer - implemented on top of the flat container - is provided to aid migration from JBoss Cache.

  • JBoss Cache traditionally competed against other frameworks like EHCache and Terracotta. Infinispan, on the other hand, goes head to head against Oracle’s Coherence, Gemfire and Gigaspaces.

I have put up some FAQs on the project. A project roadmap is also available, as well as a 5-minute guide to using Infinispan.

Have a look at JIRA or grab the code from our Subversion repository to see where we are with things. If you are interested in participating in Infinispan, be sure to read our community page.

I look forward to your feedback!

Cheers Manik


back to top