I have a vector of strings—myStrings
—in R that look something like:
[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.
where another url
is a valid http url but stackoverflow will not let me insert more than one url thats why i'm writing another url
instead. I want to remove all the urls from myStrings
to look like:
[1] download file from
[2] this is the link to my website
[3] go to from more info.
I've tried many functions in the stringr
package but nothing works.
You can use gsub
with a regular expression to match URLs,
Set up a vector:
x <- c(
"download file from http://example.com",
"this is the link to my website http://example.com",
"go to http://example.com from more info.",
"Another url ftp://www.example.com",
"And https://www.example.net"
)
Remove all the URLs from each string:
gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from" "this is the link to my website"
# [3] "go to from more info." "Another url"
# [5] "And"
Update: It would be best if you could post a few different URLs so we know what we're working with. But I think this regular expression will work for the URLs you mentioned in the comments:
" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"
The above expression explained:
?
optional space (f|ht)
match "f"
or "ht"
tp
match "tp"
(s?)
optionally match "s"
if it's there(://)
match "://"
(.*)
match every character (everything) up to[.|/]
a period or a forward-slash(.*)
then everything after thatI'm not an expert with regular expressions, but I think I explained that correctly.
Note: url shorteners are no longer allowed in SO answers, so I was forced to remove a section while making my most recent edit. See edit history for that part.