What are Spark RDD graph, lineage graph, DAG of Spark tasks? what are their relations

Rui picture Rui · Jul 29, 2015 · Viewed 7.2k times · Source

When we talk about RDD graphs, does it mean lineage graph or DAG (direct acyclic graph) or both? and when is the lineage graph generated? is it generated before the DAG of Spark tasks?

Answer

Daniel Darabos picture Daniel Darabos · Jul 30, 2015

An RDD can depend on zero or more other RDDs. For example when you say x = y.map(...), x will depend on y. These dependency relationships can be thought of as a graph.

You can call this graph a lineage graph, as it represents the derivation of each RDD. It is also necessarily a DAG, since a loop is impossible to be present in it.

Narrow dependencies, where a shuffle is not required (think map and filter) can be collapsed into a single stage. Stages are a unit of execution, and they are generated by the DAGScheduler from the graph of RDD dependencies. Stages also depend on each other. The DAGScheduler builds and uses this dependency graph (which is also necessarily a DAG) to schedule the stages.