On some of the sites I want to index with Nutch there are only specific types of pages I would like to be searchable. I need a way to be able to crawl these sites, but only index pages that match a certain regular expression.
ex:
www.example.com/browse/ finds links in the form of www.example.com/items/1234.html and www.example.com/items/browse_by_xyz.html. I need to be able to index just the www.example.com/items/1234.html style links while still crawling the browse_by_xyz.html style links.
From my searching I thought that I could use crawl-urlfilter.txt to restrict where Nutch crawled, and regex-urlfilter.txt to restrict what was actually indexed. This did not seem to work, so I was either misinformed or implemented it correctly.
Does Nutch have the capability I am looking for?