tags:

views:

356

answers:

3

Possible Duplicate:
Which CPAN module would you recommend for turning HTML into plain text?

Question:

  • Is there a module to render HTML, specifically to gather the text, while adhering to font-style tags, such as <tt>, <b>, <i>, etc and break-line <br>, similar to Lynx.

For example:

# cat test.html

<body>  
<div id="foo" class="blah">  
<tt>test<br>
<b>test</b><br>
whatever<br>
test</tt>
</div>
</body>

# lynx.exe --dump test.html

test
test
whatever
test

Note: the second line should be bold.

+9  A: 

Lynx is a big program and its html rendering will be non trivial.

How about this:

my $lynx = '/path/to/lynx';
my $html = [ html here ];
my $txt = `$lynx --dump --width 9999 -stdin <<EOF\n$html\nEOF\n`;
singingfish
+1. Wanted to suggest the same answer.
Boldewyn
This might just have to do, unfortunately.
Aaron
Shouldn't line 3 contain $lynx insteady of lynx? Otherwise, /path/to/lynx is ignored.
Tristan Havelick
+6  A: 

Go to search.cpan.org and search for HTML text which will give you lots of options to suit your particular needs. HTML::FormatText is a good baseline, and then branch out into specific variations of it, for example HTML::FormatText::WithLinks if you want to preserve links as footnotes.

Schwern
+2  A: 

I am on Windows so I cannot fully test this but you can adapt htext that comes with HTML::Parser:

#!/usr/bin/perl

use strict; use warnings;

use HTML::Parser;
use Term::ANSIColor;

use HTML::Parser 3.00 ();

my %inside;

sub tag {
   my($tag, $num) = @_;
   $inside{$tag} += $num;
   print " ";  # not for all tags
}

sub text {
    return if $inside{script} || $inside{style};
    my $esc = 1;
    if ( $inside{b} or $inside{strong} ) {
        print color 'blue';
    }
    elsif ( $inside{i} or $inside{em} ) {
        print color 'yellow';
    }
    else {
        $esc = 0;
    }
    print $_[0];
    print color 'reset' if $esc;
}

HTML::Parser->new(api_version => 3,
    handlers => [
        start => [\&tag, "tagname, '+1'"],
        end   => [\&tag, "tagname, '-1'"],
        text  => [\&text, "dtext"],
    ],
    marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";;
Sinan Ünür