The use cases you cite are somewhat out of date and refer to Heritrix 1.x (filters have been replaced with decide rules, very different configuration framework). Still the basic concept is the same.
The cxml file is basically a Spring configuration file. You need to configure the property shouldProcessRule
on the ARCWriter bean to be the ContentTypeMatchesRegexDecideRule
A possible ARCWriter configuration:
<bean id="warcWriter" class="org.archive.modules.writer.ARCWriterProcessor">
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="ACCEPT" />
<property name="regex" value="^text/html.*">
</bean>
</property>
<!-- Other properties that need to be set ... -->
</bean>
This will cause the Processor to only process those items that match the DecideRule, which in turn only passes those whose content type (mime type) matches the provided regular expression.
Be careful about the 'decision' setting. Are you ruling things in our out? (My example rules things in, anything not matching is ruled out).
As shouldProcessRule
is inherited from Processor, this can be applied to any processor.
More information about configuring Heritrix 3 can be found on the Heritrix 3 Wiki (the user guide on crawler.archive.org is about Heritrix 1)