I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?
Scala:
object Main extends App {
val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
println(file.length)
}
Spark:
val distFile = sc.textFile(file)
println(distFile.length)
but if i process it not getting file size. How to find the RDD size?
If you are simply looking to count the number of rows in the rdd
, do:
val distFile = sc.textFile(file)
println(distFile.count)
If you are interested in the bytes, you can use the SizeEstimator
:
import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html