keep only alphanumeric characters and space in a string using gsub

lilipunk picture lilipunk · Apr 8, 2017 · Viewed 16.2k times · Source

I have a string which has alphanumeric characters, special characters and non UTF-8 characters. I want to strip the special and non utf-8 characters.

Here's what I've tried:

gsub('[^0-9a-z\\s]','',"�+ Sample string here =�{�>E�BH�P<]�{�>")

However, This removes the special characters (punctuations + non utf8) but the output has no spaces.

gsub('/[^0-9a-z\\s]/i','',"�+ Sample string here =�{�>E�BH�P<]�{�>")

The result has spaces but there are still non utf8 characters present.

Any work around?

For the sample string above, output should be: Sample string here

Answer

Joris Meys picture Joris Meys · Apr 8, 2017

You could use the classes [:alnum:] and [:space:] for this:

sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

Alternatively you can use PCRE codes to refer to specific character sets:

gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:

gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"

This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:

gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "

Or, if you want to find a string, even not bound by spaces, use the word boundary \\b instead:

gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"

What happens here:

  • .*? fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.
  • everything between () will be stored and can be refered to in the replacement by \\1
  • \\b indicates a word boundary
  • This is followed at least once (+) by any character that's A-Z, a-z, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using A-z will include all special letters (which are UTF-8 btw!)
  • after that sequence,fit anything at least zero times to remove the rest of the string.
  • the backreference \\1 in combination with .* in the regex, will make sure only the required part remains in the output.