Apache Spark
Apache Spark
Apache Spark
• Since all RDD is an immutable data set. Each RDD keeps track of
the lineage of the deterministic operation that employee on fault-
tolerant input dataset to create it.
• If any partition of an RDD is lost due to a worker node failure,
then that partition can be re-computed from the original fault-
tolerant dataset using the lineage of operations.
• Assuming that all of the RDD transformations are deterministic,
the data in the final transformed RDD will always be the same
irrespective of failures in the Spark cluster.
To achieve fault tolerance for all the generated RDDs, the achieved data
replicates among multiple Spark executors in worker node in the cluster.
This result in two types of data that should recover in the event of failure: