I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond:
Remote file exists but does not contain any link -- not retrieving.
But for sure the sitemap.xml is full of "http://..." links.
I tried almost every option of wget but nothing worked for me:
wget -r --mirror http://mysite.com/sitemap.xml
Does anyone knows how to open all links inside of a website sitemap.xml?
Thanks, Dominic
It seems that wget
can't parse XML. So, you'll have to extract the links manually. You could do something like this:
wget --quiet http://www.mysite.com/sitemap.xml --output-document - | egrep -o "https?://[^<]+" | wget -i -
I learned this trick here.