how do i delete files in hdfs directory after reading it using scala

user1125829 picture user1125829 · Jul 14, 2017 · Viewed 22.4k times · Source

I use fileStream to read files in the hdfs directory from Spark (streaming context). In case my Spark shut down and starts after some time, I would like to read the new files in the directory. I don't want to read old files in the directory which was already read and processed by Spark. I am trying to avoid duplicates here.

val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/File")

any code snippets to help?

Answer

Ishan Kumar picture Ishan Kumar · Jul 14, 2017

You can use the FileSystem API:

import org.apache.hadoop.fs.{FileSystem, Path}

val fs = FileSystem.get(sc.hadoopConfiguration)

val outPutPath = new Path("/abc")

if (fs.exists(outPutPath))
  fs.delete(outPutPath, true)