Apache Spark provides Data Analytics for Big Data. Apache Spark is particularly good at data mining across multiple disparate datasources.

Apache Spark was initially written by Matei Zaharia at UC Berkley’s AMPLab in 2009. Spark was developed to address many of the limitations of MapReduce, specifically to reduce steps, the data is stored in shared memory on the system as opposed to being written to disk. As a result, Apache Spark is substantially faster than MapReduce due to the reduction in IO contention. Spark is also much easier to code in than MapReduce.

Outside of the main Spark engine, Spark has 4 main modules:

  • Spark SQL and Data Frames
    • SQL access to data from different datasources including Hive, JDBC, Parquet, and DataStax
    • No ETL is required to join data.
  • Spark Streaming – Kafka
  • Machine Learning – MLlib
  • GraphX with GraphFrames

We are big fans of using the Apache Spark that is built into DataStax Enterprise. We particularly like making use of the Spark Cassandra Connector to enable DataStax to do Reporting. The key is to remember to join on the Partition Keys.