views:

133

answers:

3

I have some HTML and I need to extract the actual written text from the page.

So far I have tried using a web browser and rendering the page, then going to the document property and grabbing the text. This works, but only where the browser is supported (IE com object). The problem is I want this to be able to run under wine also, so I need a solution that doesn't use IE COM.

There must be a programatic way to do this that is reasonable.

+3  A: 

I'm not sure what the recommended way of parsing HTML in Delphi is, but if it were me, I'd be tempted to just bundle a copy of html2text (either the older C++ program by that name or the newer Python program) and spawn a call to one of those.

You can turn the Python html2text into an executable using py2exe. Both html2text programs are licensed under the GPL, but as long as you merely bundle their executable with your app and make their source available according to the GPL's restrictions, then you ought to be okay.

Josh Kelley
One of the textmode browsers (like lynx/links/w3c) might also do this (iirc there is a parameter --dump for this with afaik w3c) , and they probably have mingw versions somewhere. Or at least they are in cygwin
Marco van de Voort
A: 

Instead of using a TWebBrowser, you can directly use a TIdHttp and its Get method.
You get the html string back.

François
That's the underlying HTML, not the rendered text. "Rendered" means the text a human being would read if he or she looked at the Web browser displaying the page on the screen.
Rob Kennedy
Oh OK. I thought the OP wanted to get the html without needing IE. That would be the 1st step though. ... and yes I should read more carefully ;-)
François
Combined with StripHTMLTags (@lkessler), this becomes a nice option.
Chris Thornton
+1  A: 

Here's a nice simple routine, copied from Scalabium:

function StripHTMLTags(const strHTML: string): string;
var
  P: PChar;
  InTag: Boolean;
  i, intResultLength: Integer;
begin
  P := PChar(strHTML);
  Result := '';

  InTag := False;
  repeat
    case P^ of
      '<': InTag := True;
      '>': InTag := False;
      #13, #10: ; {do nothing}
      else
        if not InTag then
        begin
          if (P^ in [#9, #32]) and ((P+1)^ in [#10, #13, #32, #9, '<']) then
          else
            Result := Result + P^;
        end;
    end;
    Inc(P);
  until (P^ = #0);

  {convert system characters}
  Result := StringReplace(Result, '&quot;', '"',  [rfReplaceAll]);
  Result := StringReplace(Result, '&apos;', '''', [rfReplaceAll]);
  Result := StringReplace(Result, '&gt;',   '>',  [rfReplaceAll]);
  Result := StringReplace(Result, '&lt;',   '<',  [rfReplaceAll]);
  Result := StringReplace(Result, '&amp;',  '&',  [rfReplaceAll]);
  {here you may add another symbols from RFC if you need}
end;

You can then easily modify this to do exactly what you want.

lkessler