How do I extract hashtags from tweets in R?

jackStinger picture jackStinger · Dec 7, 2012 · Viewed 7.2k times · Source

I know this question had been asked here and here but there was a small problem when I tried it out:

x<- str_extract("Hello peopllz! My new home is #crazy gr8! #wow", "#\S+")
Error: '\S' is an unrecognized escape in character string starting "#\S"

I changed the regex to "#(.+) ?", "#\\s", but they did not extract the hashtags.

I then tried the gsub way:

x<- gsub("[^#(.+) ?]","","Hello! #London is gr8. #Wow")

It gave: " # . #"

Any ideas where I am going wrong? I'd like my output as a vector/list of all the hashtags in the tweet(without the hashes!)

Edit: I would prefer not tokenizing the tweet, because: 1. I am not tokenizing the tweets for the rest of my program, 2. It would become a very expensive step were I to scale it to handle large volumes of tweets.

Answer

Backlin picture Backlin · Dec 7, 2012

Use "#\\S+" instead of "#\S+".

str_extract_all("Hello peopllz! My new home is #crazy gr8! #wow", "#\\S+")
# [[1]]
# [1] "#crazy" "#wow"  

There are two levels of parsing going on here. Before the low level regexp function within str_extract gets the pattern you want to search for (i.e. "#\S+") it is first parsed by R. R does not recognize \S as a valid escape character and throws an error. By escaping the slash with \\ you tell R to pass the \ and S as two normal characters to the regexp function, instead of interpreting it as one escape character.

Side track

This can produce rather bizarre expressions. Imagine that you have a list of addresses to computers on a windows network on the form of "\\computer". To search for it you would need to type str_extract(adr, "\\\\\\w+") which would turn into "\\\w+" internally and then search for.