What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus.
The platform is linux.
What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus.
The platform is linux.
Note: html2ascii can also be called html2a
or html2text
(and I wasn't able to find a proper man page on the net for it).
See also: lynx
.
I know that w3m can be used to render an html document and put the text content in a textfile w3m www.google.com > file.txt for example.
For the remainder, I'm sure that wget can be used.
Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.
PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.
Use wget to download the required html and then run html2text on the output files.