views:

175

answers:

4

Hi , I'm using WebBrowser to get source of html pages . Our page source have some text and some html tags . like this :

FONT></P><P align=center><FONT color=#ccffcc size=3>**Hello There , This is a text in our html page** </FONT></P><P align=center> </P>

Html tags are random and we can not guess them . So is there any way to get texts only and separating them from html tags ?

+1  A: 

You should look at using the Delphi DOM HTML parser

irishbuzz
Thank you . i need more description (also the question is edited)
Kermia
+1  A: 

If your asterisk is constant, you can simply get everychar between **. If your asterisk is not constant you can rewrite this string and erase all tags (things who starting from < and ending with >. Or you can use some DOM parser library for it.

Svisstack
Tags and Texts are random !
Kermia
+1  A: 

you can use a TWebBrowser instance to parse and select the plaint text from html code.

see this sample

uses
MSHTML,
SHDocVw,
ActiveX;

function GetPlainText(Const Html: string): string;
var
DummyWebBrowser: TWebBrowser;
Document       : IHtmlDocument2;
DummyVar       : Variant;
begin
   Result := '';
   DummyWebBrowser := TWebBrowser.Create(nil);
   try
     //open an blank page to create a IHtmlDocument2 instance
     DummyWebBrowser.Navigate('about:blank');
     Document := DummyWebBrowser.Document as IHtmlDocument2; 
     if (Assigned(Document)) then //Check the Document
     begin
       DummyVar      := VarArrayCreate([0, 0], varVariant); //Create a variant array to write the html code to the  IHtmlDocument2
       DummyVar[0]   := Html; //assign the html code to the variant array
       Document.Write(PSafeArray(TVarData(DummyVar).VArray)); //set the html in the document
       Document.Close;
       Result :=(Document.body as IHTMLBodyElement).createTextRange.text;//get the plain text
     end;
   finally
     DummyWebBrowser.Free;
   end;
end;
RRUZ
Thank you . but with using this function , result is : "FONT></P><P align=center><FONT color=#ccffcc size=3>**Hello There , This is a text in our html page** </FONT></P><P align=center> </P>" . html tags are here still .
Kermia
By nested functions was solved : GetPlainText(GetPlainText(MyString)); . :D Thank you Mr Piruz
Kermia
A: 

In essence: in general you can't.

HTML is a markup language with such a wide use and mind boggling possibilities to change the content dynamically that it is virtually impossible to do this (just look at how hard the web browser suppliers need to work to pass for instance the acid tests). So you can only do a subset.

For specific and well defined subsets of HTML, then you have a better chance:

First you need to get the HTML in a string, then parse that HTML.

Getting the HTML can be done for instance using Indy (see answers to this question).

Parsing highly depends on your HTML and can be quite complex, you can try this question or this search.

You could use TWebBrowser as RRuz suggests, but it depends on Internet Explorer.
Modern Windows systems do not guarantee that Internet Explorer is installed any more...

--jeroen

Jeroen Pluimers
Hi Jeroen , i'm using EmbeddedWebBrowser componenet and there is no prooblem :)
Kermia
Until you run your software on a computer that has no Internet Explorer installed; then it will fail. That might not be a problem, but it is something you need to be aware of.
Jeroen Pluimers