When we talk about RDD graphs, does it mean lineage graph or DAG (direct acyclic graph) or both? and when is the lineage graph generated? is it generated before the DAG of Spark tasks?
An RDD can depend on zero or more other RDDs. For example when you say x = y.map(...)
, x
will depend on y
. These dependency relationships can be thought of as a graph.
You can call this graph a lineage graph, as it represents the derivation of each RDD. It is also necessarily a DAG, since a loop is impossible to be present in it.
Narrow dependencies, where a shuffle is not required (think map
and filter
) can be collapsed into a single stage. Stages are a unit of execution, and they are generated by the DAGScheduler
from the graph of RDD dependencies. Stages also depend on each other. The DAGScheduler
builds and uses this dependency graph (which is also necessarily a DAG) to schedule the stages.