Printschema() in Apache Spark

rushikesh jachak picture rushikesh jachak · Apr 30, 2018 · Viewed 43.7k times · Source
Dataset<Tweet> ds = sc.read().json("/path").as(Encoders.bean(Tweet.class));



Tweet class :-
long id
string user;
string text;


ds.printSchema();

Output:-

root
  |-- id: string (nullable = true)
  |-- text: string (nullable = true)  
  |-- user: string (nullable = true)

json file has all arguments of string type

My question is am taking input and encoding it as Tweet.class .The datatype specified for id in the schema is Long but when schema is printed it is cast to String.

Does it give printscheme a/c to how it reads the file or according to encoding we do (here Tweet.class)?

Answer

ROOT picture ROOT · Apr 30, 2018

i don't know the exact reason why your code is not working, but if you want to change the filed type you can write your customSchema.

val schema =  StructType(List
                        (
                          StructField("id", LongType, nullable = true),
                          StructField("text", StringType, nullable = true),
                          StructField("user", StringType, nullable = true)
                        )))

you can apply schema to your dataframe as follows:

Dataset<Tweet> ds = sc.read().schema(schema).json("/path")

ds.printSchema()