Spark SQL: How to consume json data from a REST service as DataFrame

Kiran picture Kiran · May 9, 2016 · Viewed 16.6k times · Source

I need to read some JSON data from a web service thats providing REST interfaces to query the data from my SPARK SQL code for analysis. I am able to read a JSON stored in the blob store and use it.

I was wondering what is the best way to read the data from a REST service and use it like a any other DataFrame.

BTW I am using SPARK 1.6 of Linux cluster on HD insight if that helps. Also would appreciate if someone can share any code snippets for the same as I am still very new to SPARK environment.

Answer

aggFTW picture aggFTW · May 10, 2016

On Spark 1.6:

If you are on Python, use the requests library to get the information and then just create an RDD from it. There must be some similar library for Scala (relevant thread). Then just do:

json_str = '{"executorCores": 2, "kind": "pyspark", "driverMemory": 1000}'
rdd = sc.parallelize([json_str])
json_df = sqlContext.jsonRDD(rdd)
json_df

Code for Scala:

val anotherPeopleRDD = sc.parallelize(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.read.json(anotherPeopleRDD)

This is from: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets