Filter df when values matches part of a string in pyspark

python apache-spark pyspark apache-spark-sql

gaatjeniksaan · Jan 27, 2017 · Viewed 98k times · Source

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL saved in the location column contains a pre-determined string, e.g. 'google.com'.

I have tried:

import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)

but this throws a

TypeError: _TypeError: 'Column' object is not callable'

How do I go around and filter my df properly? Many thanks in advance!

Answer

Spark 2.2 onwards

df.filter(df.location.contains('google.com'))
Spark 2.2 documentation link

Spark 2.1 and before

You can use plain SQL in filter
df.filter("location like '%google.com%'")
or with DataFrame column methods
df.filter(df.location.like('%google.com%'))
Spark 2.1 documentation link

Filter df when values matches part of a string in pyspark

Answer

Spark 2.2 onwards

Spark 2.1 and before

Related questions