views:

260

answers:

6

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus.

The platform is linux.

+4  A: 

wget | html2ascii

Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).

See also: lynx.

dsm
Does html2text have a strip white-space option, because I couldn't find it
Cammel
Not that I am aware of, but you can use awk/sed/perl ... etc to strip whitespace
dsm
A: 

I know that w3m can be used to render an html document and put the text content in a textfile w3m www.google.com > file.txt for example.

For the remainder, I'm sure that wget can be used.

Jean Azzopardi
A: 

Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.

Robert Elwell
A: 

PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.

olle
A: 

Use wget to download the required html and then run html2text on the output files.

Krishna Gopalakrishnan
+2  A: 

Python Beautiful Soup allows you to build a nice extractor.

S.Lott