Removing NULL , NAN, empty space from PySpark DataFrame

Question 1

Removing NULL , NAN, empty space from PySpark DataFrame

apache-spark pyspark apache-spark-1.6

Sumit · Jan 24, 2018 · Viewed 9.4k times · Source

Answer

Answer

NaN (not a number) has different meaning that NULL and empty string is just a normal value (can be converted to NULL automatically with csv reader) so na.drop won't match these.

You can convert all to null and drop

from pyspark.sql.functions import col, isnan, when, trim

df = spark.createDataFrame([
    ("", 1, 2.0), ("foo", None, 3.0), ("bar", 1, float("NaN")), 
    ("good", 42, 42.0)])

def to_null(c):
    return when(~(col(c).isNull() | isnan(col(c)) | (trim(col(c)) == "")), col(c))


df.select([to_null(c).alias(c) for c in df.columns]).na.drop().show()
# +----+---+----+
# |  _1| _2|  _3|
# +----+---+----+
# |good| 42|42.0|
# +----+---+----+

Question 2

I have a dataframe in PySpark which contains empty space, Null, and Nan. I want to remove rows which have any of those. I tried below commands, but, nothing seems to work.

myDF.na.drop().show()
myDF.na.drop(how='any').show()

Below is the dataframe:

+---+----------+----------+-----+-----+
|age|  category|      date|empId| name|
+---+----------+----------+-----+-----+
| 25|electronic|17-01-2018|  101|  abc|
| 24|    sports|16-01-2018|  102|  def|
| 23|electronic|17-01-2018|  103|  hhh|
| 23|electronic|16-01-2018|  104|  yyy|
| 29|       men|12-01-2018|  105| ajay|
| 31|      kids|17-01-2018|  106|vijay|
|   |       Men|       nan|  107|Sumit|
+---+----------+----------+-----+-----+

What am I missing? What is the best way to tackle NULL, Nan or empty spaces so that there is no problem in the actual calculation?

Removing NULL , NAN, empty space from PySpark DataFrame

Answer

Related questions