How to Index Only Pages with Certain Urls with Nutch?

views:

answers:

+1 Q:

How to Index Only Pages with Certain Urls with Nutch?

Hi, I want nutch to crawl abc.com, but I want to index only car.abc.com. car.abc.com links can in any levels in abc.com. So, basically, I want nutch to keep crawl abc.com normally, but index only pages that start as car.abc.com. e.g. car.abc.com/toyota...car.abc.com/honda...

I set the regex-urlfilter.txt to include only car.abc.com and run the command "generate crawl/crawldb crawl/segments", but it just say "Generator: 0 records selected for fetching, exiting ..." . I guess car.abc.com links exist only in several levels deep.

How to do this? Thanks.

+1 A:

One way is to use the -filter switch of the mergedb command. The command takes a crawl db as input and created a new crawl db with some urls filtered. Just use that filtered crawl db for indexing.

The only drawback to this is that I have not found a way for the mergedb command to use another file than regex-urlfilter.txt, which is the file used by the generator. You will have to maintain two files like regex-urlfilter.txt: one used for the generator with abc.com and another one used for the mergedb command that excludes urls not like car.abc.com. But since both command try to load the same file, you will have to rename the appropriate file to regex-urlfilter.txt before calling one of the two commands.

If someone knows a way to configure the mergedb command to use another file, I'd be happy to hear it!

Pascal Dimassimo 2010-07-15 17:09:23

ansaurus

tags:

views:

answers:

How to Index Only Pages with Certain Urls with Nutch?

related questions