ansaurus

Question

Answer 1

+2 A:

What version of Nutch are you using? I'm not familiar with Nutch but the default download of Nutch 1.0 already contains a rule in regex-normalize.xml which seems to handle this problem.

<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
  <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$4</substitution>
</regex>

Btw. regex-urlfilter.txt seems to contain something of relevance too

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

Then there are some settings in nutch-default.xml which you might want to check out

urlnormalizer.order
urlnormalizer.regex.file
plugin.includes

If that all doesn't help maybe this does: How can I force fetcher to use custom nutch-config?

jitter 2009-11-17 23:19:15

I am using Nutch version 0.8.1.This version has the following setting in nutch-default.xml:urlnormalizer.class...instead of urlnormalizer.orderI changed the value from org.apache.nutch.net.BasicUrlNormalizer to org.apache.nutch.net.RegexUrlNormalizer.This is what causes the regex-normalize.xml file to actually be engaged when crawling.Also, I added the following plugin to the 'plugin-includes' value:urlnormalizer-(pass|regex|basic)This is not included by default in version 0.8.1.Thanks soo much for pointing me in the right direction!

Anand Krishnan 2009-11-20 20:42:47

No problem. Now just consider up-voting and accepting my answer

jitter 2009-11-20 22:02:33

ansaurus

tags:

views:

answers:

configuring nutch regex-normalize.xml

related questions