I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt
.
The site I want to crawl has a URL similar to this:
http://www.example.com/foo.cfm
On that page there are numerous links that match the following pattern:
http://www.example.com/foo.cfm/Bar_-_Foo/Extra/EX/20817/ID=6976
I want to crawl links that match second example above as well. In my regex-urlfilter.txt
I have the following:
+^http://www.example.com/foo.cfm$
+^http://www.example.com/foo.cfm/(.+)*$
Nutch matches on the first one and crawls it correctly, but does not seem to pick up links using the other filter. How can I get Nutch to crawl URL's like the second one above?
I have tried the following with no luck:
+^http://www.example.com/foo.cfm/(.+)*$
+^http://www.example.com/foo.cfm/(.)*$
+^http://www.example.com/foo.cfm/.+$
+^http://www.example.com/foo.cfm/(.*)*$
In my NUTCH_ROOT/urls/nutch
I have:
http://www.example.com/foo.cfm/
According to http://wiki.apache.org/nutch/FAQ#What_happens_if_I_inject_urls_several_times.3F you can't have multiple URLs (they will be ignored). What about to put only:
+^http://www.example.com/foo.cfm/(.+)*$
which should cover your first line: +^http://www.example.com/foo.cfm$
as well, or, if there are problems with /
, try:
+^http://www.example.com/foo.cfm//?(.+)*$
Where //?
should stand for character /
or