I know this question had been asked here and here but there was a small problem when I tried it out:
x<- str_extract("Hello peopllz! My new home is #crazy gr8! #wow", "#\S+")
Error: '\S' is an unrecognized escape in character string starting "#\S"
I changed the regex to "#(.+) ?"
, "#\\s"
, but they did not extract the hashtags.
I then tried the gsub way:
x<- gsub("[^#(.+) ?]","","Hello! #London is gr8. #Wow")
It gave: " # . #"
Any ideas where I am going wrong? I'd like my output as a vector/list of all the hashtags in the tweet(without the hashes!)
Edit: I would prefer not tokenizing the tweet, because: 1. I am not tokenizing the tweets for the rest of my program, 2. It would become a very expensive step were I to scale it to handle large volumes of tweets.
Use "#\\S+"
instead of "#\S+"
.
str_extract_all("Hello peopllz! My new home is #crazy gr8! #wow", "#\\S+")
# [[1]]
# [1] "#crazy" "#wow"
There are two levels of parsing going on here. Before the low level regexp function within str_extract
gets the pattern you want to search for (i.e. "#\S+"
) it is first parsed by R. R does not recognize \S
as a valid escape character and throws an error. By escaping the slash with \\
you tell R to pass the \
and S
as two normal characters to the regexp function, instead of interpreting it as one escape character.
This can produce rather bizarre expressions. Imagine that you have a list of addresses to computers on a windows network on the form of "\\computer"
. To search for it you would need to type str_extract(adr, "\\\\\\w+")
which would turn into "\\\w+"
internally and then search for.