crawl links of sitemap.xml through wget command

dohomi picture dohomi · Jun 27, 2013 · Viewed 7.8k times · Source

I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond:

Remote file exists but does not contain any link -- not retrieving.

But for sure the sitemap.xml is full of "http://..." links.

I tried almost every option of wget but nothing worked for me:

wget -r --mirror http://mysite.com/sitemap.xml

Does anyone knows how to open all links inside of a website sitemap.xml?

Thanks, Dominic

Answer

user440788 picture user440788 · Jan 2, 2014

It seems that wget can't parse XML. So, you'll have to extract the links manually. You could do something like this:

wget --quiet http://www.mysite.com/sitemap.xml --output-document - | egrep -o "https?://[^<]+" | wget -i -

I learned this trick here.