In R, how do I replace a string that contains a certain pattern with another string?

Alan picture Alan · Mar 14, 2011 · Viewed 20.5k times · Source

I'm working on a project involving cleaning a list of data on college majors. I find that a lot are misspelled, so I was looking to use the function gsub() to replace the misspelled ones with its correct spelling. For example, say 'biolgy' is misspelled in a list of majors called Major. How can I get R to detect the misspelling and replace it with its correct spelling? I've tried gsub('biol', 'Biology', Major) but that only replaces the first four letters in 'biolgy'. If I do gsub('biolgy', 'Biology', Major), it works for that case alone, but that doesn't detect other forms of misspellings of 'biology'.

Thank you!

Answer

aL3xa picture aL3xa · Mar 14, 2011

You should either define some nifty regular expression, or use agrep from base package. stringr package is another option, I know that people use it, but I'm a very huge fan of regular expressions, so it's a no-no for me.

Anyway, agrep should do the trick:

agrep("biol", "biology")
[1] 1
agrep("biolgy", "biology")
[1] 1

EDIT:

You should also use ignore.case = TRUE, but be prepared to do some bookkeeping "by hand"...