views:

73

answers:

1

hi

i want to know where the crawled files are stored in Heritrix web crawler...

thanks and advance

A: 

From the developer manual:

By default, heritrix writes all its crawled to disk using ARCWriterProcessor. This processor writes the found crawl content as Internet Archive ARC files. The ARC file format is described here: Arc File Format. Heritrix writes version 1 ARC files 1.

The ARC files are located in the arcs/ folder of your crawl-instance. You can change the location in the settings of the web-GUI of heritrix.

Instead of the default ARCWriterProcessor, you can set it to WARCWriterProcessor (WARC files), to MirrorWriterProcessor (no container at all) or to a Kw3WriterProcessor. AFAIK, you could even set multiple writers. Note that when choosing the MirrorWriterProcessor, not all files may be written to disc, depending on the file system you're using to write the files to.

[1] Internet Archive ARC files

Bart Kiers