tags:

views:

47

answers:

1

Hi,

I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/) and one other.

The urls on my webiste are basically of this format

http://mysite.com/index.php?main%5Fpage=index&params=12

http://mysite.com/index.php?main%5Fpage=index&category=tub&param=17

i.e. the urls differ only in terms of parameters appened to the url (the part "http://mysite.com/index.php?" is common to all urls)

Is Nutch unable to crawl such webistes?

What Nutch settings should I do in order to crawl such websites?

A: 

I got the issue fixed. It had everything to do with the url filter set as

skip URLs containing certain characters as probable queries, etc

-[?*!@=]

I commented this filter and Nutch crawle dall urls :)

Annibigi