how to filter out a null value from spark dataframe

Steven Li picture Steven Li · Sep 27, 2016 · Viewed 181.4k times · Source

I created a dataframe in spark with the following schema:

root
 |-- user_id: long (nullable = false)
 |-- event_id: long (nullable = false)
 |-- invited: integer (nullable = false)
 |-- day_diff: long (nullable = true)
 |-- interested: integer (nullable = false)
 |-- event_owner: long (nullable = false)
 |-- friend_id: long (nullable = false)

And the data is shown below:

+----------+----------+-------+--------+----------+-----------+---------+
|   user_id|  event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
|   4236494| 110357109|      0|      -1|         0|  937597069|     null|
|  78065188| 498404626|      0|       0|         0| 2904922087|     null|
| 282487230|2520855981|      0|      28|         0| 3749735525|     null|
| 335269852|1641491432|      0|       2|         0| 1490350911|     null|
| 437050836|1238456614|      0|       2|         0|  991277599|     null|
| 447244169|2095085551|      0|      -1|         0| 1579858878|     null|
| 516353916|1076364848|      0|       3|         1| 3597645735|     null|
| 528218683|1151525474|      0|       1|         0| 3433080956|     null|
| 531967718|3632072502|      0|       1|         0| 3863085861|     null|
| 627948360|2823119321|      0|       0|         0| 4092665803|     null|
| 811791433|3513954032|      0|       2|         0|  415464198|     null|
| 830686203|  99027353|      0|       0|         0| 3549822604|     null|
|1008893291|1115453150|      0|       2|         0| 2245155244|     null|
|1239364869|2824096896|      0|       2|         1| 2579294650|     null|
|1287950172|1076364848|      0|       0|         0| 3597645735|     null|
|1345896548|2658555390|      0|       1|         0| 2025118823|     null|
|1354205322|2564682277|      0|       3|         0| 2563033185|     null|
|1408344828|1255629030|      0|      -1|         1|  804901063|     null|
|1452633375|1334001859|      0|       4|         0| 1488588320|     null|
|1625052108|3297535757|      0|       3|         0| 1972598895|     null|
+----------+----------+-------+--------+----------+-----------+---------+

I want to filter out the rows have null values in the field of "friend_id".

scala> val aaa = test.filter("friend_id is null")

scala> aaa.count

I got :res52: Long = 0 which is obvious not right. What is the right way to get it?

One more question, I want to replace the values in the friend_id field. I want to replace null with 0 and 1 for any other value except null. The code I can figure out is:

val aaa = train_friend_join.select($"user_id", $"event_id", $"invited", $"day_diff", $"interested", $"event_owner", ($"friend_id" != null)?1:0)

This code also doesn't work. Can anyone tell me how can I fix it? Thanks

Answer

Sachin Tyagi picture Sachin Tyagi · Sep 27, 2016

Let's say you have this data setup (so that results are reproducible):

// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)

// setting up example data
val e1 = Employee("n1", null, "[email protected]", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "[email protected]", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "[email protected]", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "[email protected]", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "[email protected]", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "[email protected]", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "[email protected]", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "[email protected]", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF

Data is:

+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|[email protected]|[c1,1,d1]|
|  n2|   2|[email protected]|[c1,1,d1]|
|  n3|   3|[email protected]|[c1,1,d1]|
|  n4|   4|[email protected]|[c2,2,d2]|
|  n5|null|[email protected]|[c2,2,d2]|
|  n6|   6|[email protected]|[c2,2,d2]|
|  n7|   7|[email protected]|[c3,3,d3]|
|  n8|   8|[email protected]|[c3,3,d3]|
+----+----+---------+---------+

Now to filter employees with null ids, you will do --

df.filter("id is null").show

which will correctly show you following:

+----+----+---------+---------+
|name|  id|    email|  company|
+----+----+---------+---------+
|  n1|null|[email protected]|[c1,1,d1]|
|  n5|null|[email protected]|[c2,2,d2]|
+----+----+---------+---------+

Coming to the second part of your question, you can replace the null ids with 0 and other values with 1 with this --

df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show

This results in:

+----+---+---------+---------+
|name| id|    email|  company|
+----+---+---------+---------+
|  n1|  0|[email protected]|[c1,1,d1]|
|  n2|  1|[email protected]|[c1,1,d1]|
|  n3|  1|[email protected]|[c1,1,d1]|
|  n4|  1|[email protected]|[c2,2,d2]|
|  n5|  0|[email protected]|[c2,2,d2]|
|  n6|  1|[email protected]|[c2,2,d2]|
|  n7|  1|[email protected]|[c3,3,d3]|
|  n8|  1|[email protected]|[c3,3,d3]|
+----+---+---------+---------+