I am trying to read Json file using Spark v2.0.0. In case of simple data code works really well. In case of little bit complex data, when i print df.show() the data is not showing in correct way.
here is my code:
SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.show();
Here is my sample data:
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
And my output is like:
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "glossary": {|
| "title": ...|
| "GlossDiv": {|
| "titl...|
| "GlossList": {|
| "...|
| ...|
| "SortAs": "S...|
| "GlossTerm":...|
| "Acronym": "...|
| "Abbrev": "I...|
| "GlossDef": {|
| ...|
| "GlossSeeAl...|
| ...|
| "GlossSee": ...|
| }|
| }|
| }|
+--------------------+
only showing top 20 rows
You will need to format the JSON to one line if you have to read this JSON. This is a multi line JSON and hence is not being read and loaded properly (One Object one Row)
Quoting the JSON API :
Loads a JSON file (one object per line) and returns the result as a DataFrame.
{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}}}}
I just tried it on the shell , it should work from the code as well the same way (I had the same corrupted record error when i read a multi line JSON)
scala> val df = spark.read.json("C:/DevelopmentTools/data.json")
df: org.apache.spark.sql.DataFrame = [glossary: struct<GlossDiv: struct<GlossList: struct<GlossEntry: struct<Abbrev: string, Acronym: string ... 5 more fields>>, title: string>, title: string>]
scala>
Edits :
You can get the values out from that data frame using any action , for example
scala> df.select(df("glossary.GlossDiv.GlossList.GlossEntry.GlossTerm")).show()
+--------------------+
| GlossTerm|
+--------------------+
|Standard Generali...|
+--------------------+
scala>
You should be able to do it from your code as well