I use fileStream to read files in the hdfs directory from Spark (streaming context). In case my Spark shut down and starts after some time, I would like to read the new files in the directory. I don't want to read old files in the directory which was already read and processed by Spark. I am trying to avoid duplicates here.
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/File")
any code snippets to help?
You can use the FileSystem
API:
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(sc.hadoopConfiguration)
val outPutPath = new Path("/abc")
if (fs.exists(outPutPath))
fs.delete(outPutPath, true)