1. Intro to Apache Spark

What is Spark?

Spark is a general-purpose in-memory computing engine. It is also called, plug and play compute engine because it can be plugged into any resource manager and storage.

Have you heard of Hadoop?
If yes, what does it provide to us for data processing?

Hadoop Provides Us
  • HDFS - Storage
  • Map Reduce - Computation/Processing
  • Yarn - Resource Management 

What is Spark Cluster?
Spark Cluster is the collection of master and slave nodes.

What is Node?
It can be considered as a resource with its own computing capabilities. 

Can Spark be Considered a Replacement for Hadoop?
Spark is not a replacement rather it is an alternative to Map Reduce.

Why is Spark faster than Map Reduce?
MapReduce involves more DISK read and write operations than Spark.
SPARK provides low latency because it involves less DISK read and writes operations.

If a job has to perform 5 iterations on the data, Map Reduce will deep dive into 10 Disk Read Write operations whereas Spark will have only 2 Disk Read Write operations.