How to convert column with string type to int form in pyspark data frame?

python dataframe pyspark

neha · Oct 26, 2017 · Viewed 126.1k times · Source

I have dataframe in pyspark. Some of its numerical columns contain 'nan' so when I am reading the data and checking for the schema of dataframe, those columns will have 'string' type. How I can change them to int type.I replaced the 'nan' values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:

data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()

my data looks like this:

here columns 'Plays' and 'drafts' containing integer values but because of nan present in these columns,they are treated as string type.

Answer

from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))

You can run loop for each column but this is the simplest way to convert string column into integer.

How to convert column with string type to int form in pyspark data frame?

Answer

Related questions