Monday, 25 August 2014

Partitioned clusters tell no lies!

The problem

You are happily running a 10-node cluster. You want failover and speed and are using distributed mode with 2 copies of data for each key (numOwners=2). But disaster strikes: a switch in your network crashes and 5 of your nodes can’t reach the other 5 anymore ! Now there are two independent clusters, each containing 5 nodes, which we are smartly going to name P1 and P2. Both P1 and P2 continue to serve user requests (puts and gets) as usual.

This cluster split in two or more parts is called partitioning or split brain. And it’s bad for business, as in really bad ! Bob and Alice share a bank account stored in the cache. Bob updates his account on P1, then Alice reads it from P2: she sees a stale value of Bob’s account (or even no value for Bob’s account, depending on how the split looks like). This is a consistency issue, as there’s an inconsistent view of the data between the two partitions.

Our solution

In Infinispan 7.0.0.Beta1 we added support for reacting to split brains: if nodes leave, Infinispan acknowledges that data might have been lost and denies user access to such data. We won’t deny access to all the data, but just the data that might have been affected by the partitioning. Or, more formally: Infinispan sacrifices data availability in order to offer consistency (PC in Brewer’s CAP theorem). For now partition handling is disabled by default, however we do intend to make it the default in an upcoming release: running with partition handling off is like running with scissors: do it at your own risk and only if you (don’t) know what you’re doing.

How we do it

A partition is assumed to happen when numOwners or more nodes disappear at the same time. When this happen two (or more) partitions form which are not aware of each other. Each such partition does not start a rebalance, but enters in degraded mode:

  • request (read and writes) for entries that have all the copies on nodes within this partition are honored

  • requests for entries that are partially or totally owned by nodes that disappeared are rejected through an AvailabilityException

To exemplify, consider the initial cluster C0=\{A,B,C,D}, A,B,C,D - nodes, configured in distributed mode with numOwners=2. Further on, the cluster contains k1, k2 and k3 keys such that owners(k1) = \{A,B}, owners(k2) = \{B,C} and owners(k3) = \{C,D}. Then a partition happens C1=\{A,B} and C2=\{C,D}, the degraded mode exhibits the following behavior:

  • on C1, k1 is available for read/write, k2 (partially owned) and k3 (not owned) are not available and accessing them results in an AvailabilityException

  • on C2, k1 and k2  are not available for read/write, k3 is available

A relevant aspect of the partition handling process is the fact that when a split brain happens, the resulting partitions rely on the original consistent hash function (the one that existed before the split brain) in order to calculate key ownership. So it doesn’t matter if k1, k2 or k3 already exists in the cluster or not, as the availability is strictly determined by the consistent hash and not by the key existence.

If at a further point in time the initial partition C0 forms again as a result of the network healing and C1 and C2 partitions being merged back together, then C0 exists the degraded mode becoming fully available again.

Configuration for partition handling functionality

In order to enable partition handling within the XML configuration:

The same can be achieved programmatically:

The actual implementation is work in progress and Beta2 will contain further improvements which we will publish here!

Cheers, Mircea Markus

Posted by Mircea Markus on 2014-08-25
Tags: split brain partition handling availability

News

Tags

JUGs alpha as7 asymmetric clusters asynchronous beta c++ cdi chat clustering community conference configuration console data grids data-as-a-service database devoxx distributed executors docker event functional grouping and aggregation hotrod infinispan java 8 jboss cache jcache jclouds jcp jdg jpa judcon kubernetes listeners meetup minor release off-heap openshift performance presentations product protostream radargun radegast recruit release release 8.2 9.0 final release candidate remote query replication queue rest query security spring streams transactions vert.x workshop 8.1.0 API DSL Hibernate-Search Ickle Infinispan Query JP-QL JSON JUGs JavaOne LGPL License NoSQL Open Source Protobuf SCM administration affinity algorithms alpha amazon annotations announcement archetype archetypes as5 as7 asl2 asynchronous atomic maps atomic objects availability aws beer benchmark benchmarks berkeleydb beta beta release blogger book breizh camp buddy replication bugfix c# c++ c3p0 cache benchmark framework cache store cache stores cachestore cassandra cdi cep certification cli cloud storage clustered cache configuration clustered counters clustered locks codemotion codename colocation command line interface community comparison compose concurrency conference conferences configuration console counter cpp-client cpu creative cross site replication csharp custom commands daas data container data entry data grids data structures data-as-a-service deadlock detection demo deployment dev-preview devnation devoxx distributed executors distributed queries distribution docker documentation domain mode dotnet-client dzone refcard ec2 ehcache embedded query equivalence event eviction example externalizers failover faq final fine grained flags flink full-text functional future garbage collection geecon getAll gigaspaces git github gke google graalvm greach conf gsoc hackergarten hadoop hbase health hibernate hibernate ogm hibernate search hot rod hotrod hql http/2 ide index indexing india infinispan infinispan 8 infoq internationalization interoperability interview introduction iteration javascript jboss as 5 jboss asylum jboss cache jbossworld jbug jcache jclouds jcp jdbc jdg jgroups jopr jpa js-client jsr 107 jsr 347 jta judcon kafka kubernetes lambda language leveldb license listeners loader local mode lock striping locking logging lucene mac management map reduce marshalling maven memcached memory migration minikube minishift minor release modules mongodb monitoring multi-tenancy nashorn native near caching netty node.js nodejs nosqlunit off-heap openshift operator oracle osgi overhead paas paid support partition handling partitioning performance persistence podcast presentations protostream public speaking push api putAll python quarkus query quick start radargun radegast react reactive red hat redis rehashing releaase release release candidate remote remote events remote query replication rest rest query roadmap rocksdb ruby s3 scattered cache scripting second level cache provider security segmented server shell site snowcamp spark split brain spring spring boot spring-session stable standards state transfer statistics storage store store by reference store by value streams substratevm synchronization syntax highlighting testing tomcat transactions uneven load user groups user guide vagrant versioning vert.x video videos virtual nodes vote voxxed voxxed days milano wallpaper websocket websockets wildfly workshop xsd xsite yarn zulip

back to top