get out links from nutch

web-crawler nutch

surajz · Sep 15, 2011 · Viewed 7.3k times · Source

I am using nutch 1.3 to crawl a website. I want to get a list of urls crawled, and urls originating from a page.

I get list of urls crawled using readdb command.

bin/nutch readdb crawl/crawldb -dump file

Is there a way to find out urls that are on a page by reading crawldb or linkdb ?

in the org.apache.nutch.parse.html.HtmlParser I see outlinks array, I am wondering if there is a quick way to access it from command line.

Answer

From command line, you can see the outlinks by using readseg with -dump or -get option. For example,

bin/nutch readseg -dump crawl/segments/20110919084424/ outputdir2 -nocontent -nofetch - nogenerate -noparse -noparsetext

less outputdir2/dump