What's the meaning of DStream.foreachRDD function?

Guo picture Guo · Apr 5, 2016 · Viewed 16.6k times · Source

In spark streaming, every batch interval of data always generate one and only one RDD, why do we use foreachRDD() to foreach RDD? RDD is only one, needn't foreach. In my testing, I never see RDD more than one.

Answer

maasg picture maasg · Apr 6, 2016

A DStream or "discretized stream" is an abstraction that breaks a continuous stream of data into small chunks. This is called "microbatching". Each microbatch becomes an RDD that is given to Spark for further processing. There's one and only one RDD produced for each DStream at each batch interval.

An RDD is a distributed collection of data. Think of it as a set of pointers to where the actual data is in a cluster.

DStream.foreachRDD is an "output operator" in Spark Streaming. It allows you to access the underlying RDDs of the DStream to execute actions that do something practical with the data. For example, using foreachRDD you could write data to a database.

The little mind twist here is to understand that a DStream is a time-bound collection. Let me contrast this with a classical collection: Take a list of users and apply a foreach to it:

val userList: List[User] = ???
userList.foreach{user => doSomeSideEffect(user)}

This will apply the side-effecting function doSomeSideEffect to each element of the userList collection.

Now, let's say that we don't know all the users now, so we cannot build a list of them. Instead, we have a stream of users, like people arriving into a coffee shop during morning rush:

val userDStream: DStream[User] = ???
userDstream.foreachRDD{usersRDD => 
    usersRDD.foreach{user => serveCoffee(user)}
}

Note that:

  • the DStream.foreachRDD gives you an RDD[User], not a single user. Going back to our coffee example, that is the collection of users that arrived during some interval of time.
  • to access single elements of the collection, we need to further operate on the RDD. In this case, I'm using a rdd.foreach to serve coffee to each user.

To think about execution: We might have a cluster of baristas making coffee. Those are our executors. Spark Streaming takes care of making a small batch of users (or orders) and Spark will distribute the work across the baristas, so that we can parallelize the coffee making and speed up the coffee serving.