Removing NULL , NAN, empty space from PySpark DataFrame

Sumit picture Sumit · Jan 24, 2018 · Viewed 9.4k times · Source

I have a dataframe in PySpark which contains empty space, Null, and Nan. I want to remove rows which have any of those. I tried below commands, but, nothing seems to work.

myDF.na.drop().show()
myDF.na.drop(how='any').show()

Below is the dataframe:

+---+----------+----------+-----+-----+
|age|  category|      date|empId| name|
+---+----------+----------+-----+-----+
| 25|electronic|17-01-2018|  101|  abc|
| 24|    sports|16-01-2018|  102|  def|
| 23|electronic|17-01-2018|  103|  hhh|
| 23|electronic|16-01-2018|  104|  yyy|
| 29|       men|12-01-2018|  105| ajay|
| 31|      kids|17-01-2018|  106|vijay|
|   |       Men|       nan|  107|Sumit|
+---+----------+----------+-----+-----+

What am I missing? What is the best way to tackle NULL, Nan or empty spaces so that there is no problem in the actual calculation?

Answer

Alper t. Turker picture Alper t. Turker · Jan 24, 2018

NaN (not a number) has different meaning that NULL and empty string is just a normal value (can be converted to NULL automatically with csv reader) so na.drop won't match these.

You can convert all to null and drop

from pyspark.sql.functions import col, isnan, when, trim

df = spark.createDataFrame([
    ("", 1, 2.0), ("foo", None, 3.0), ("bar", 1, float("NaN")), 
    ("good", 42, 42.0)])

def to_null(c):
    return when(~(col(c).isNull() | isnan(col(c)) | (trim(col(c)) == "")), col(c))


df.select([to_null(c).alias(c) for c in df.columns]).na.drop().show()
# +----+---+----+
# |  _1| _2|  _3|
# +----+---+----+
# |good| 42|42.0|
# +----+---+----+