Infinispan Spark connector 0.1 released!

Dear users,

The Infinispan connector for Apache Spark has just been made available as a Spark Package!

What is it?

The Infinispan Spark connector allows tight integration with Apache Spark, allowing Spark jobs to be run against data stored in the Infinispan Server, exposing any cache as an RDD, and also writing data from any key/value RDD to a cache. It’s also possible to create a DStream backed by cache events and to save any key-value DStream to a cache.

The minimum version required is Infinispan 8.0.0.Beta3.

Giving it a spin with Docker

A handy docker image that contains an Infinispan cluster co-located with an Apache Spark standalone cluster is the fastest way to try the connector. Start by launching the container that hosts the Spark Master:

And then run as many worker nodes as you want:

Using the shell

The Apache Spark shell is a convenient way to quickly run jobs in an interactive fashion. Taking advantage of the fact that Spark is already installed in the docker containers (and thus the shell), let’s attach to the master:

Once inside, a Spark shell can be launched by:

That’s all it’s needed. The shell grabs the Infinispan connector and its dependencies from and exposes them in the classpath.

Generating data and writing to Infinispan

Let’s obtain a list of words from the Linux dictionary, and generate 1k random 4-word phrases. Paste the commands in the shell:

From the phrases, we’ll create a key value RDD (Long, String):

To save to Infinispan:

Obtaining facts about data

To be able to explore data in the cache, the first step is to create an infinispan RDD:

As an example job, let’s calculate a histogram showing the distribution of word lengths in the phrases. This is simply a sequence of transformations expressed by:

This pipeline yields:

2 chars words: 10 occurrences 3 chars words: 37 occurrences 4 chars words: 133 occurrences 5 chars words: 219 occurrences 6 chars words: 373 occurrences 7 chars words: 428 occurrences 8 chars words: 510 occurrences 9 chars words: 508 occurrences 10 chars words: 471 occurrences 11 chars words: 380 occurrences 12 chars words: 309 occurrences 13 chars words: 238 occurrences


Now let’s find similar words using the Levenshtein distance algorithm. For that we need to define a function that will calculate the edit distance between two strings. As usual, paste in the shell:

Empowered by the Levenshtein distance implementation, we need another function that given a word, will find in the cache similar words according to the provided maximum edit distance:

Sample usage:

Where to go from here

And that concludes this first post on Infinispan-Spark integration. Be sure to check the Twitter demo for non-shell usages of the connector, including Java and Scala API.

And it goes without saying, your feedback is much appreciated! :)



JUGs alpha as7 asymmetric clusters asynchronous beta c++ cdi chat clustering community conference configuration console data grids data-as-a-service database devoxx distributed executors docker event functional grouping and aggregation hotrod infinispan java 8 jboss cache jcache jclouds jcp jdg jpa judcon kubernetes listeners meetup minor release off-heap openshift performance presentations product protostream radargun radegast recruit release release 8.2 9.0 final release candidate remote query replication queue rest query security spring streams transactions vert.x workshop 8.1.0 API DSL Hibernate-Search Ickle Infinispan Query JP-QL JSON JUGs JavaOne LGPL License NoSQL Open Source Protobuf SCM administration affinity algorithms alpha amazon anchored keys annotations announcement archetype archetypes as5 as7 asl2 asynchronous atomic maps atomic objects availability aws beer benchmark benchmarks berkeleydb beta beta release blogger book breizh camp buddy replication bugfix c# c++ c3p0 cache benchmark framework cache store cache stores cachestore cassandra cdi cep certification cli cloud storage clustered cache configuration clustered counters clustered locks codemotion codename colocation command line interface community comparison compose concurrency conference conferences configuration console counter cpp-client cpu creative cross site replication csharp custom commands daas data container data entry data grids data structures data-as-a-service deadlock detection demo deployment dev-preview development devnation devoxx distributed executors distributed queries distribution docker documentation domain mode dotnet-client dzone refcard ec2 ehcache embedded embedded query equivalence event eviction example externalizers failover faq final fine grained flags flink full-text functional future garbage collection geecon getAll gigaspaces git github gke google graalvm greach conf gsoc hackergarten hadoop hbase health hibernate hibernate ogm hibernate search hot rod hotrod hql http/2 ide index indexing india infinispan infinispan 8 infoq internationalization interoperability interview introduction iteration javascript jboss as 5 jboss asylum jboss cache jbossworld jbug jcache jclouds jcp jdbc jdg jgroups jopr jpa js-client jsr 107 jsr 347 jta judcon kafka kubernetes lambda language learning leveldb license listeners loader local mode lock striping locking logging lucene mac management map reduce marshalling maven memcached memory migration minikube minishift minor release modules mongodb monitoring multi-tenancy nashorn native near caching netty node.js nodejs non-blocking nosqlunit off-heap openshift operator oracle osgi overhead paas paid support partition handling partitioning performance persistence podcast presentation presentations protostream public speaking push api putAll python quarkus query quick start radargun radegast react reactive red hat redis rehashing releaase release release candidate remote remote events remote query replication rest rest query roadmap rocksdb ruby s3 scattered cache scripting second level cache provider security segmented server shell site snowcamp spark split brain spring spring boot spring-session stable standards state transfer statistics storage store store by reference store by value streams substratevm synchronization syntax highlighting tdc testing tomcat transactions tutorial uneven load user groups user guide vagrant versioning vert.x video videos virtual nodes vote voxxed voxxed days milano wallpaper websocket websockets wildfly workshop xsd xsite yarn zulip
Posted by Gustavo on 2015-08-17
Tags: release spark
back to top