How to drop malformed rows while reading csv with schema Spark?

Question 1

How to drop malformed rows while reading csv with schema Spark?

apache-spark apache-spark-dataset

HouZhe · Apr 9, 2018 · Viewed 9.4k times · Source

Answer

Answer

If you are reading a CSV file and want to drop the rows that do not match the schema. You can do this by adding the option mode as DROPMALFORMED

Input data

a,1.0
b,2.2
c,xyz
d,4.5
e,asfsdfsdf
f,3.1

Schema

val schema = StructType(Seq(
  StructField("key", StringType, false),
  StructField("value", DoubleType, false)
))

Reading a csv file with schema and option as

  val df = spark.read.schema(schema)
    .option("mode", "DROPMALFORMED")
    .csv("/path to csv file ")

Output:

+-----+-----+
|key  |value|
+-----+-----+
|hello|1.0  |
|hi   |2.2  |
|how  |3.1  |
|you  |4.5  |
+-----+-----+

You can get more details on spark-csv here

Hope this helps!

Question 2

While I am using Spark DataSet to load a csv file. I prefer designating schema clearly. But I find there are a few rows not compliant with my schema. A column should be double, but some rows are non-numeric values. Is it possible to filter all rows that are not compliant with my schema from DataSet easily?

val schema = StructType(StructField("col", DataTypes.DoubleType) :: Nil)
val ds = spark.read.format("csv").option("delimiter", "\t").schema(schema).load("f.csv")

f.csv:

a
1.0

I prefer "a" can be filtered from my DataSet easily. Thanks!

How to drop malformed rows while reading csv with schema Spark?

Answer

Related questions