I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:
partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet")
I then go to my Python notebook to read in the data:
df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet")
and I get the following error:
AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'
I have looked at the spark documentation and I don't think I should be required to specify a schema. Has anyone run into something like this? Should I be doing something else when I save/load? The data is landing in Object Storage.
edit: I'm sing spark 2.0 in both the read and the write.
edit2: This was done in a project in Data Science Experience.
I read parquet file in the following way:
from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
.master('local') \
.appName('myAppName') \
.config('spark.executor.memory', '5gb') \
.config("spark.cores.max", "6") \
.getOrCreate()
sc = spark.sparkContext
# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
# to read parquet file
df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')