How do I read a parquet in PySpark written from Spark?

Question 1

How do I read a parquet in PySpark written from Spark?

python scala apache-spark pyspark data-science-experience

Ross Lewis · Mar 24, 2017 · Viewed 74k times · Source

Answer

Answer

I read parquet file in the following way:

from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
    .master('local') \
    .appName('myAppName') \
    .config('spark.executor.memory', '5gb') \
    .config("spark.cores.max", "6") \
    .getOrCreate()

sc = spark.sparkContext

# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# to read parquet file
df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')

Question 2

I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:

partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet")

I then go to my Python notebook to read in the data:

df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet")

and I get the following error:

AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'

I have looked at the spark documentation and I don't think I should be required to specify a schema. Has anyone run into something like this? Should I be doing something else when I save/load? The data is landing in Object Storage.

edit: I'm sing spark 2.0 in both the read and the write.

edit2: This was done in a project in Data Science Experience.

How do I read a parquet in PySpark written from Spark?

Answer

Related questions