I am using nutch 1.3 to crawl a website. I want to get a list of urls crawled, and urls originating from a page.
I get list of urls crawled using readdb command.
bin/nutch readdb crawl/crawldb -dump file
Is there a way to find out urls that are on a page by reading crawldb or linkdb ?
in the org.apache.nutch.parse.html.HtmlParser
I see outlinks array, I am wondering if there is a quick way to access it from command line.