Infinispan Spark connector 0.5 released!

The Infinispan Spark connector offers seamless integration between Apache Spark and Infinispan Servers. Apart from supporting Infinispan 9.0.0.Final and Spark 2.1.0, this release brings many usability improvements, and support for another major Spark API.

Configuration changes

The connector no longer uses a java.util.Properties object to hold configuration, that’s now duty of org.infinispan.spark.config.ConnectorConfiguration, type safe and both Java and Scala friendly:

 

Filtering by query String

The previous version introduced the possibility of filtering an InfinispanRDD by providing a Query instance, that required going through the QueryDSL which in turn required a properly configured remote cache.

It’s now possible to simply use an Ickle query string:

Improved support for Protocol Buffers

Support for reading from a Cache with protobuf encoding was present in the previous connector version, but now it’s possible to also write using protobuf encoding and also have protobuf schema registration automatically handled.

To see this in practice, consider an arbitrary non-Infinispan based RDD<Integer, Hotel> where Hotel is given by:

In order to write this RDD to Infinispan it’s just a matter of doing:

Internally the connector will trigger the auto-generation of the .proto file and message marshallers related to the configured entity(ies) and will handle registration of schemas in the server prior to writing.

Splitter is now pluggable

The Splitter is the interface responsible to create one or more partitions from a Infinispan cache, being each partition related to one or more segments. The Infinispan Spark connector now can be created using a custom implementation of Splitter allowing for different data partitioning strategies during the job processing.

Goodbye Scala 2.10

Scala 2.10 support was removed, Scala 2.11 is currently the only supported version. Scala 2.12 support will follow https://issues.apache.org/jira/browse/SPARK-14220

 

Streams with initial state

It is possible to configure the InfinispanInputDStream with an extra boolean parameter to receive the current cache state as events.

 

Dataset support

The Infinispan Spark connector now ships with support for Spark’s Dataset API, with support for pushing down predicates, similar to rdd.filterByQuery. The entry point of this API is the Spark session:

To create an Infinispan based Dataframe, the "infinispan" data source need to be used, along with the usual connector configuration:

From here it’s possible to use the untyped API, for example:

or execute SQL queries by setting a view:

In both cases above, the predicates and the required columns will be converted to an Infinispan Ickle filter, thus filtering data at the source rather than at Spark processing phase.

For the full list of changes see the release notes. For more information about the connector, the official documentation is the place to go. Also check the twitter data processing sample and to report bugs or request new features use the project JIRA.

News

Tags

JUGs alpha as7 asymmetric clusters asynchronous beta c++ cdi chat clustering community conference configuration console data grids data-as-a-service database devoxx distributed executors docker event functional grouping and aggregation hotrod infinispan java 8 jboss cache jcache jclouds jcp jdg jpa judcon kubernetes listeners meetup minor release off-heap openshift performance presentations product protostream radargun radegast recruit release release 8.2 9.0 final release candidate remote query replication queue rest query security spring streams transactions vert.x workshop 8.1.0 API DSL Hibernate-Search Ickle Infinispan Query JP-QL JSON JUGs JavaOne LGPL License NoSQL Open Source Protobuf SCM administration affinity algorithms alpha amazon anchored keys annotations announcement archetype archetypes as5 as7 asl2 asynchronous atomic maps atomic objects availability aws beer benchmark benchmarks berkeleydb beta beta release blogger book breizh camp buddy replication bugfix c# c++ c3p0 cache benchmark framework cache store cache stores cachestore cassandra cdi cep certification cli cloud storage clustered cache configuration clustered counters clustered locks codemotion codename colocation command line interface community comparison compose concurrency conference conferences configuration console counter cpp-client cpu creative cross site replication csharp custom commands daas data container data entry data grids data structures data-as-a-service deadlock detection demo deployment dev-preview development devnation devoxx distributed executors distributed queries distribution docker documentation domain mode dotnet-client dzone refcard ec2 ehcache embedded embedded query equivalence event eviction example externalizers failover faq final fine grained flags flink full-text functional future garbage collection geecon getAll gigaspaces git github gke google graalvm greach conf gsoc hackergarten hadoop hbase health hibernate hibernate ogm hibernate search hot rod hotrod hql http/2 ide index indexing india infinispan infinispan 8 infoq internationalization interoperability interview introduction iteration javascript jboss as 5 jboss asylum jboss cache jbossworld jbug jcache jclouds jcp jdbc jdg jgroups jopr jpa js-client jsr 107 jsr 347 jta judcon kafka kubernetes lambda language learning leveldb license listeners loader local mode lock striping locking logging lucene mac management map reduce marshalling maven memcached memory migration minikube minishift minor release modules mongodb monitoring multi-tenancy nashorn native near caching netty node.js nodejs non-blocking nosqlunit off-heap openshift operator oracle osgi overhead paas paid support partition handling partitioning performance persistence podcast presentation presentations protostream public speaking push api putAll python quarkus query quick start radargun radegast react reactive red hat redis rehashing releaase release release candidate remote remote events remote query replication rest rest query roadmap rocksdb ruby s3 scattered cache scripting second level cache provider security segmented server shell site snowcamp spark split brain spring spring boot spring-session stable standards state transfer statistics storage store store by reference store by value streams substratevm synchronization syntax highlighting tdc testing tomcat transactions tutorial uneven load user groups user guide vagrant versioning vert.x video videos virtual nodes vote voxxed voxxed days milano wallpaper websocket websockets wildfly workshop xsd xsite yarn zulip
Posted by Gustavo on 2017-04-03
Tags: spark server
back to top