I'm still relatively new to Pyspark. I use version 2.1.0. I'm trying to clean some data on a much larger data set. I've successfully used several techniques such as "dropDuplicates" along with subsets and sql functions (distinct, count etc).
I then came across dropna, which I thought might simplify matters. But I do not understand why rows 3 and 6 are still present after using dropna. For example:
df = spark.createDataFrame([(1, 'Peter', 1.79, 28,'M', 'Tiler'),
(2, 'Fritz', 1.78, 45,'M', None),
(3, 'Florence', 1.75, None, None, None),
(4, 'Nicola',1.6, 33,'F', 'Dancer'),
(5, 'Gregory', 1.8, 54,'M', 'Teacher'),
(6, 'Steven', 1.82, None, 'M', None),
(7, 'Dagmar', 1.7, 42,'F', 'Nurse'),]
, ['id', 'Name', 'Height', 'Age', 'Gender', 'Occupation'])
df.show()
df.dropna(thresh=2)
df.show()
Output:
+---+--------+------+----+------+----------+
| id| Name|Height| Age|Gender|Occupation|
+---+--------+------+----+------+----------+
| 1| Peter| 1.79| 28| M| Tiler|
| 2| Fritz| 1.78| 45| M| null|
| 3|Florence| 1.75|null| null| null|
| 4| Nicola| 1.6| 33| F| Dancer|
| 5| Gregory| 1.8| 54| M| Teacher|
| 6| Steven| 1.82|null| M| null|
| 7| Dagmar| 1.7| 42| F| Nurse|
+---+--------+------+----+------+----------+
+---+--------+------+----+------+----------+
| id| Name|Height| Age|Gender|Occupation|
+---+--------+------+----+------+----------+
| 1| Peter| 1.79| 28| M| Tiler|
| 2| Fritz| 1.78| 45| M| null|
| 3|Florence| 1.75|null| null| null|
| 4| Nicola| 1.6| 33| F| Dancer|
| 5| Gregory| 1.8| 54| M| Teacher|
| 6| Steven| 1.82|null| M| null|
| 7| Dagmar| 1.7| 42| F| Nurse|
+---+--------+------+----+------+----------+
Could somebody suggest why the rows aren't removed?
The pyspark examples show correct counts as per the usage I assumed.
# threshold
self.assertEqual(self.spark.createDataFrame(
[(u'Alice', None, 80.1)], schema).dropna(thresh=2).count(),
1)
self.assertEqual(self.spark.createDataFrame(
[(u'Alice', None, None)], schema).dropna(thresh=2).count(),
0)
First thing is, na
creates a new dataframe, so assign it to new df name, and 2nd, specify subset to check which columns to check for null values
df2 = df.dropna(thresh=2,subset=('Age','Gender','Occupation'))
df2.show()
output :
+---+-------+------+---+------+----------+
| id| Name|Height|Age|Gender|Occupation|
+---+-------+------+---+------+----------+
| 1| Peter| 1.79| 28| M| Tiler|
| 2| Fritz| 1.78| 45| M| null|
| 4| Nicola| 1.6| 33| F| Dancer|
| 5|Gregory| 1.8| 54| M| Teacher|
| 7| Dagmar| 1.7| 42| F| Nurse|
+---+-------+------+---+------+----------+
edit : by the way, thresh=2
alone doesnt work because thresh means drop rows that have less than thresh (i.e. 2 in this case) non-null values, but 3rd row has id,name and height i.e total 3 non-nulls and 6th row has 4 non-nulls, so they dont satisfy thresh=2
criteria. You can try thresh=5