Remove URLs from string

Tavi picture Tavi · Aug 17, 2014 · Viewed 14.4k times · Source

I have a vector of strings—myStrings—in R that look something like:

[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.

where another url is a valid http url but stackoverflow will not let me insert more than one url thats why i'm writing another url instead. I want to remove all the urls from myStrings to look like:

[1] download file from
[2] this is the link to my website
[3] go to from more info.

I've tried many functions in the stringr package but nothing works.

Answer

Rich Scriven picture Rich Scriven · Aug 17, 2014

You can use gsub with a regular expression to match URLs,

Set up a vector:

x <- c(
    "download file from http://example.com", 
    "this is the link to my website http://example.com", 
    "go to http://example.com from more info.",
    "Another url ftp://www.example.com",
    "And https://www.example.net"
)

Remove all the URLs from each string:

gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from"             "this is the link to my website"
# [3] "go to from more info."          "Another url"                   
# [5] "And"   

Update: It would be best if you could post a few different URLs so we know what we're working with. But I think this regular expression will work for the URLs you mentioned in the comments:

" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"

The above expression explained:

  • ? optional space
  • (f|ht) match "f" or "ht"
  • tp match "tp"
  • (s?) optionally match "s" if it's there
  • (://) match "://"
  • (.*) match every character (everything) up to
  • [.|/] a period or a forward-slash
  • (.*) then everything after that

I'm not an expert with regular expressions, but I think I explained that correctly.

Note: url shorteners are no longer allowed in SO answers, so I was forced to remove a section while making my most recent edit. See edit history for that part.