Lecture notes: Introduction to Apache Spark – OnData.blog

In Lecture 7 of Big Data in 30 hours lecture series, we introduce Apache Spark. The purpose of this memo is to serve to the students as a reference of some of the concepts learned.

About Spark

Spark, managed by Apache Software Foundation, is an open source framework that facilitates distributed computations. It is lately quite popular. With Spark, one can set up a set of machines to run tasks in a coordinated way. Spark builds on the success of its predecessor Hadoop, and often uses Hadoop’s HDFS as the underlying storage.

Main components

Spark SQL – for SQL-like constructs
Spark Streaming – for real-time processing
MLib – machine learning library
GraphX – for graph processing
Spark core are the libraries providing core abstractions such as RDD
Schedulers: YARN and Mesos

Shells:
Python (pyspark)
Scala (spark-shell)
R (sparkR)

Lecture notes: Introduction to Apache Spark

Tagged on: Big Data in 30 hours

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.