views:

821

answers:

2

I wish to download a web page, which may be in any possible text encoding, and save it as UTF16LE. Assuming I can determine the text's encoding (by examining the HTTP header, HTML header, and/or BOM), how do I convert the text?

I am using Delphi 2009. Unfortunately, the help files do not explain how to get from any encoding to a Unicode (UTF16LE) string. Specific questions:

  • Can I complete the conversion, simply by setting the correct encoding on an AnsiString and assigning that to a UnicodeString?
  • If so, how do I translate the various "charset" descriptions that may label the web page (Big5, Shift-JIS, UTF-32, etc) into the right format to initialize the AnsiString?

Thanks for your suggestions.

I have a preference for straight Win32 and VCL, but answers involving ActiveX controls may also be helpful.

+2  A: 

Hi,

how are you going to access the page? Embedded Internet Explorer, INDY, third party tool, ...? That might influence the answer because it determines the format of the input string.

Part 1: Getting the page

If you use the Embedded Internet Explorer (TWebBrowser) to access the page things are pretty straightforward:

var htmlElement:IHTMLElement;
    myText:String;
begin
  // Get access to the HTML element of the document:
  htmlElement:=(WebBrowserControl.DefaultInterface.Document as IHTMLDocument3).documentElement;
  // Receive the full HTML of the web page:
  myText:=htmlElement.OuterHTML;

The encoding of the web page should be handled properly by the IE and by Delphi and you end up with a UnicodeString containing the result (myText in the examples).

Part 2: Saving in UTF-16LE

Regardless where your string came from - you can save it like this in the desired encoding:

var s:TStringStream;
begin
  s:=TStringStream.Create(myText, TEncoding.Unicode, false);
  s.SaveToFile('yourFileToSaveTo.txt');
  FreeAndNil(s);
end;

TEncoding.Unicode is UTF-16LE, but you could also use any other encoding.

Hope this helps.

Heinrich Ulbricht
A: 

In D2009 and later, Indy 10's TIdHTTP component automatically decodes a received webpage to UTF-16 for you.

Doing a charset-to-Unicode conversion on Windows requires the use of codepages (unless you use the ICONV library), so you have to first convert a charset name to a suitable codepage, and then you can use TEncoding.GetEncoding() and TEncoding.GetString(), or call SetCodePage() on a RawByteString (not an AnsiString) that you then assign to a UnicodeString, to do the conversion (internally, Indy uses TEncoding and has its own charset-to-codepage lookup tables).

Remy Lebeau - TeamB