Nutch regex-urlfilter syntax

user670595 picture user670595 · Dec 14, 2012 · Viewed 8.7k times · Source

I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt.

The site I want to crawl has a URL similar to this:

http://www.example.com/foo.cfm

On that page there are numerous links that match the following pattern:

http://www.example.com/foo.cfm/Bar_-_Foo/Extra/EX/20817/ID=6976

I want to crawl links that match second example above as well. In my regex-urlfilter.txt I have the following:

+^http://www.example.com/foo.cfm$
+^http://www.example.com/foo.cfm/(.+)*$

Nutch matches on the first one and crawls it correctly, but does not seem to pick up links using the other filter. How can I get Nutch to crawl URL's like the second one above?

I have tried the following with no luck:

+^http://www.example.com/foo.cfm/(.+)*$
+^http://www.example.com/foo.cfm/(.)*$
+^http://www.example.com/foo.cfm/.+$
+^http://www.example.com/foo.cfm/(.*)*$

In my NUTCH_ROOT/urls/nutch I have:

http://www.example.com/foo.cfm/

Answer

xhudik picture xhudik · Dec 18, 2012

According to http://wiki.apache.org/nutch/FAQ#What_happens_if_I_inject_urls_several_times.3F you can't have multiple URLs (they will be ignored). What about to put only:

+^http://www.example.com/foo.cfm/(.+)*$

which should cover your first line: +^http://www.example.com/foo.cfm$ as well, or, if there are problems with /, try:

+^http://www.example.com/foo.cfm//?(.+)*$

Where //? should stand for character / or