Friday, 02 November 2018
Infinispan Spark, Infinispan Hadoop and Infinispan Kafka have a new fresh release each!
The native Apache Spark connector now supports Infinispan 9.4.x and Spark 2.3.2, and it exposes Infinispan’s new transcoding capabilities, enabling the InfinispanRDD and InfinispanDStream to operate with multiple data formats. For more details see the documentation.
The connector that allows accessing Infinispan using standard Input/OutputFormat interfaces now offers compatibility with Infinispan 9.4.x and has been certified to run with the Hadoop 3.1.1 runtime. For more details about this connector, see the user manual. Also make sure to check the docker based demos: Infinispan + Yarn and Infinispan + Apache Flink.
Tags: release kafka spark hadoop
Monday, 01 February 2016
The recording of the talk presented last Wednesday in London is available! Thanks to everyone who joined and to the London JBUG organizers for the awesome new venue!
Tags: functional presentations spark hadoop streams video
Monday, 21 September 2015
The version 0.1 of the Infinispan Hadoop connector has just been made available!
The connector will host several integrations with Hadoop related projects, and in this first release it supports converting Infinispan server into a Hadoop compliant data source, by providing an implementation of InputFormat and OutputFormat.
A Hadoop InputFormat is a specification of how a certain data source can be partitioned and how to read data from each of the partitions. Conversely, OutputFormat is used to write.
Looking closely at the Hadoop’s InputFormat interface, we can see two methods:
List<InputSplit> getSplits(JobContext context); RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context);
The first method defines essentially a data partitioner, calculating one or more InputSplits that contain information about a certain partition of the data. With possession of a InputSplit, one can use it to obtain a RecordReader to iterate over the data. These two operations allow for parallelization of data processing across multiple nodes, and that’s how Hadoop map reduce achieves a high throughput over large datasets.
In Infinispan terms, each partition is a set of segments on a certain server, and a record reader is a remote iterator over those segments. The default partitioner shipped with the connector will create as many partitions as servers in the cluster, and each partition will contain the segments that are associated with that specific server.
==== Not only map reduce
Although the InfinispanInputFormat and InfinispanOutputformat can be used to run traditional Hadoop map reduce jobs over Infinispan data, it is not coupled to the Hadoop map reduce runtime. It is possible to leverage the connector to integrate Infinispan with other tools that, besides supporting Hadoop I/O interfaces, are able to read and write data more efficiently. One of those tools is Apache Flink, that has a dataflow engine capable of doing batch and stream data processing that supersedes the classic two stage map reduce approach.
==== Apache Flink example
Apache Flink supports Hadoop’s InputFormat as a data source to execute batch jobs, so to integrate with Infinispan it’s straightforward:
Please refer to the complete sample that has docker images for both Apache Flink and Infinispan server, and detailed instructions on how to execute and customise job.
More details about the connector, maven coordinates, configuration options, sources and samples can be found at the project repository
Tags: yarn hadoop server flink