I have a dataframe in PySpark which contains empty space, Null, and Nan. I want to remove rows which have any of those. I tried below commands, but, nothing seems to work.
myDF.na.drop().show()
myDF.na.drop(how='any').show()
Below is the dataframe:
+---+----------+----------+-----+-----+
|age| category| date|empId| name|
+---+----------+----------+-----+-----+
| 25|electronic|17-01-2018| 101| abc|
| 24| sports|16-01-2018| 102| def|
| 23|electronic|17-01-2018| 103| hhh|
| 23|electronic|16-01-2018| 104| yyy|
| 29| men|12-01-2018| 105| ajay|
| 31| kids|17-01-2018| 106|vijay|
| | Men| nan| 107|Sumit|
+---+----------+----------+-----+-----+
What am I missing? What is the best way to tackle NULL, Nan or empty spaces so that there is no problem in the actual calculation?
NaN
(not a number) has different meaning that NULL
and empty string is just a normal value (can be converted to NULL automatically with csv reader) so na.drop
won't match these.
You can convert all to null and drop
from pyspark.sql.functions import col, isnan, when, trim
df = spark.createDataFrame([
("", 1, 2.0), ("foo", None, 3.0), ("bar", 1, float("NaN")),
("good", 42, 42.0)])
def to_null(c):
return when(~(col(c).isNull() | isnan(col(c)) | (trim(col(c)) == "")), col(c))
df.select([to_null(c).alias(c) for c in df.columns]).na.drop().show()
# +----+---+----+
# | _1| _2| _3|
# +----+---+----+
# |good| 42|42.0|
# +----+---+----+