views:

37

answers:

1

Has any one experience a problem with the way the standard html parser plugin handles relative urls? There is a site - http://xxxx/asp/list_books.asp?id_f=11327 and when browsing a link with its href set to '?id_r=442&id=41&order=' a browser will naturally take you to http://xxxx/asp/list_books.asp?id_r=442&id=41&order=

However, in nutch when the outlinks are parsed from the page the link ends up being http://xxxx/asp/?id_r=442&id=41&order=

which of course is broken. So why is the list_books.asp gone?

+3  A: 

A bug has already been logged for this. Take a look.

dogbane
this patch helped me - https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel