Looking at pyspark, i see translate and regexp_replace to help me a single characters that exists in a dataframe column.
I was wondering if there is a way to supply multiple strings in the regexp_replace or translate so that it would parse them and replace them with something else.
Use case: remove all $, #, and comma(,) in a column A
You can use pyspark.sql.functions.translate()
to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.
For example, let's say you had the following DataFrame:
import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#| A|
#+------------------+
#| $100,00|
#| #foobar|
#|foo, bar, #, and $|
#+------------------+
and wanted to replace ('$', '#', ',')
with ('X', 'Y', 'Z')
. Simply use translate
like:
df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#| A| replaced|
#+------------------+------------------+
#| $100,00| X100Z00|
#| #foobar| Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+
If instead you wanted to remove all instances of ('$', '#', ',')
, you could do this with pyspark.sql.functions.regexp_replace()
.
df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#| A| replaced|
#+------------------+-------------+
#| $100,00| 10000|
#| #foobar| foobar|
#|foo, bar, #, and $|foo bar and |
#+------------------+-------------+
The pattern "[\$#,]"
means match any of the characters inside the brackets. The $
has to be escaped because it has a special meaning in regex.