remove multiple patterns from text vector r

vagabond picture vagabond · Mar 13, 2015 · Viewed 8.8k times · Source

I want to remove multiple patterns from multiple character vectors. Currently I am going:

a.vector <- gsub("@\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)

etc etc.

This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.

Neither the mapply nor the mgsub are working. I made these vectors

remove <- c("@\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")

Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.

a.vector looks like this:

[4951] "@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "@stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"   

I want:

[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "you are phenomenal #mental #Writing"   `

Answer

kdopen picture kdopen · Mar 13, 2015

Try combining your subpatterns using |. For example

>s<-"@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("@\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"

But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.

Consider creating your remove vector as you suggested, then applying it in a loop

> s1 <- s
> remove<-c("@\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"

This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants