I have a big distributed file on HDFS and each time I use sqlContext with spark-csv package, it first loads the entire file which takes quite some time.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path")
now as I just want to do some quick check at times, all I need is few/ any n rows of the entire file.
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").take(n)
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").head(n)
but all these run after the file load is done. Can't I just restrict the number of rows while reading the file itself ? I am referring to n_rows equivalent of pandas in spark-csv, like:
pd_df = pandas.read_csv("file_path", nrows=20)
Or it might be the case that spark does not actually load the file, the first step, but in this case, why is my file load step taking too much time then?
I want
df.count()
to give me only n
and not all rows, is it possible ?
You can use limit(n)
.
sqlContext.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true').load("file_path").limit(20)
This will just load 20 rows.