I'm working with ARC files that were generated by a Heritrix crawl. When I view these pages in the Wayback Machine, it looks like most of the graphics are being loaded from my local machine, so I'm assuming that those graphics are stored inside the ARC files. Is that correct? If so, what is the best way to extract the images?
A:
I found one solution, a perl script called arc_extractor: https://wiki.lib.umn.edu/wupl/DI2.HowToCrawl/arc_extractor.txt
It extracts all the files that are in the ARC file, separated by folder according to the site from which they were received. And yes, it does include image files.
The script isn't too elegant... so if anyone has any other suggestions I'd be interested in learning about them.
rayan
2010-06-21 15:41:42