HTML downloading and text extraction

views:

260

answers:

+1 Q:

HTML downloading and text extraction

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus.

The platform is linux.

+4 A:

wget | html2ascii

Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).

See also: lynx.

dsm 2009-01-12 14:30:01

Does html2text have a strip white-space option, because I couldn't find it

Cammel 2009-01-12 17:55:18

Not that I am aware of, but you can use awk/sed/perl ... etc to strip whitespace

dsm 2009-01-13 08:44:26

I know that w3m can be used to render an html document and put the text content in a textfile w3m www.google.com > file.txt for example.

For the remainder, I'm sure that wget can be used.

Jean Azzopardi 2009-01-12 14:31:16

Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.

Robert Elwell 2009-01-12 14:34:47

PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.

olle 2009-01-12 14:36:41

Use wget to download the required html and then run html2text on the output files.

Krishna Gopalakrishnan 2009-01-12 14:40:37

+2 A:

Python Beautiful Soup allows you to build a nice extractor.

S.Lott 2009-01-12 15:04:56

ansaurus

tags:

views:

answers:

HTML downloading and text extraction

related questions