Pyspark: display a spark data frame in a table format

Edamame picture Edamame · Aug 21, 2016 · Viewed 121.4k times · Source

I am using pyspark to read a parquet file like below:

my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**')

Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame.

Is it possible to display the data frame in a table format like pandas data frame? Thanks!

Answer

eddies picture eddies · Feb 23, 2017

The show method does what you're looking for.

For example, given the following dataframe of 3 rows, I can print just the first two rows like this:

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
df.show(n=2)

which yields:

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
+---+---+
only showing top 2 rows