Saturday, April 11, 2015

Apache Spark platform

I've not been here on the blog for a while :-( ; many things to do, many things to learn, ...
In these two years, many things have deeply changed (BigData, IoT, API oriented architecture, machine learning, docker ...) and some have been confirmed (typescript, go, ...).

Two years ago, the emerging BigData technology was the Hadoop ecosystem with plenty of technologies working quite anarchycally together (hdfs, hsqldb, cassandra, hive, pig, storm, and so on...). It was very hard for a beginner to dive into such all those techniques and make the right choices.

Few months ago a new Apache project started : the Spark project, or better the Spark platform... The main goal was firstly to enhance the performances of BigData by using a memory oriented approach... But when we look deeper  into Spark, we discover that Spark has been designed from the beginning as a complete problematics oriented platform, i.e. dealing with all the main problematics we expect to encounter such as :
- standardized access to "tabular" database through Spark SQL (replacing consequently Hive, PIG, SQL,...) , and to graph database through GraphX
- ETL-like approach for pipelining data processes through Spark Stream
- standardized access to machine learning algorithms(MLLIB)  by using morover a "R" dataframe-like api
- efficient access to big data processing in memory
-  compatibility with existing Haddop ecosystem
- native access through valuable programming languages (Java, Scala, Python) but also through api to R, and others
- quite simple installation and "prise en main"

Spark is still in infancy (but already in 1.3 version) and evolving ; nevertheless, the apache project is mature for production, and very challenging. There is no concurrent solution up to now which offers to developpers a single entry-point to so many techniques covering the whole landscape of data processing.








No comments: