How do I skip a header from CSV files in Spark?

Hafiz Mujadid picture Hafiz Mujadid · Jan 9, 2015 · Viewed 117.9k times · Source

Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers?

val rdd=sc.textFile("file1,file2,file3")

Now, how can we skip header lines from this rdd?

Answer

Jimmy picture Jimmy · Jul 3, 2015
data = sc.textFile('path_to_data')
header = data.first() #extract header
data = data.filter(row => row != header)   #filter out header