Below actions can be performed on an RDD.
- Create RDD from any source
- Transform an existing RDD
- Perform some operation on an RDD
Ways to create an RDD
- By loading data from external source.
data = spark.sparkContext("file path")
- By using parallelize method on spark context object.
list = [1,2,3,4,5]
data = spark.sparkContext.parallelize(list)
- By using makeRDD method on spark context object
list = [1,2,3,4,5]
data = spark.sparkContext.makeRDD(list)
Features of RDD- Immutable - Once an RDD is created, it cannot be changed. Spark breaks a bigger task into multiple subtasks, when a worker node processing a portion of RDD dies, the driver program can recreate those portions and assign the task of processing it to another node, completing the data processing job successfully.
As RDD is immutable, sparks break a bigger task into multiple smaller parts and distribute amount different worker nodes, and finally combine the output produced by smaller parts without worrying about the underlying data getting changed. - Distributed- When spark runs in Distributed mode it breaks downs a big task into smaller tasks and distributes on multiple worker nodes and combines the processed results to provide the final result.
- Resilient - Spark provides a mechanism to recover from faults and different error types.
- Lazy Loading - Spark does not run RDD until an action is called.
- In Memory - Spark mostly processes the data in memory to provide the capability to process the data at a very high speed. In scenarios where sufficient memory is not available, the data is spilled to the disk.
- Recomputed- Spark recomputes and RDD every time a new action is called. When there is a use case where in same RDD needs to be used at multiple places we can persist or cache the RDD.
Operations Performed On RDD
We can perform below operations on an RDD.
- Transformation - Transformations are operations performed on an RDD that results in a new RDD.
Example:- When you do a flatMap, map operation on an RDD it results in a new RDD.
Some Transformations-map, flatMap, glom, filter, distinct, union - Action -Actions are operations that evaluate RDDs and transformations and return the result. Action can either save the data to some location or return some data type that is not an RDD.