views:

1898

answers:

5

I'm slowly converting my existing code into Delphi 2010 and read several of the articles on Embarcaedro web site as well as Marco Cantú whitepaper.

There are still some things I haven't understood, so here are two functions to exemplify my question:

function RemoveSpace(InStr: string): string;
var
  Ans     : string;
  I       : Word;
  L       : Word;
  TestChar: string[1];
begin
  Ans := '';
  L := Length(InStr);
  if L > 0 then
  begin
    for I := 1 to L do
    begin
      TestChar := Copy(InStr, I, 1);
      if TestChar <> ' ' then Ans := Ans + TestChar;
    end;
  end;
  RemoveSpace := Ans;
end;

function ReplaceStr(const S, Srch, Replace: string): string;
var
  I: Integer;
  Source: string;
begin
  Source := S;
  Result := '';
  repeat
    I := Pos(Srch, Source);
    if I > 0 then begin
      Result := Result + Copy(Source, 1, I - 1) + Replace;
      Source := Copy(Source, I + Length(Srch), MaxInt);
    end
    else Result := Result + Source;
  until I <= 0;
end;

For the RemoveSpace function, if no unicode character is passed ('aa bb' for example), all is well. Now if I pass the text 'ab cd' then the function doesn't work as expected (I get ab??cd as the output).

How can I account for possible unicode characters on a string? using Length(InStr) is obviously incorrect as well as Copy(InStr, I, 1).

What's the best way of converting this code so that it accounts for unicode characters?

Thanks!

A: 

Guessing from your problem description, you seem to process UTF8-encoded strings. That's almost always a bad idea. Decode them into a saner representation first, and then operate on them. When you're done, you can encode everything as UTF-8 again.

I think the datatype for wide-character strings is "WString" in Delphi; can't look it up right now.

Ringding
+12  A: 

If those were your REAL functions and you're just trying to get em working then :

function RemoveSpace(const InStr: string): string;
begin
  Result := StringReplace(InStr, ' ', '', [rfReplaceAll]); 
end;

function ReplaceStr(const S, Srch, Replace: string): string;
begin
  Result := StringReplace(S, Srch, Replace, [rfReplaceAll, rfIgnoreCase]); 
end;
Aldo
Thanks, these are indeed real functions dating back from D3 or D4 that I never got around to upgrade them.
smartins
+1  A: 

(we do not use D10, at the moment, so beware!)

The problem in Delphi is with string literals that contain characters outside the basic ascii-range. When they are passed to string routines, the non-ascii-characters are replaced with question marks.

To avoid this, cast the text literals to WideStrings before passing them as a parameter to the function.

I do not know whether it applies to the StringReplace-routine, but Delphi's search routine Pos/Posex does not handle Unicode correctly. We had to replace these routines with our own variant. For this improved routine it is important to make sure that the parameters are of the WideString type, not the normal string type.

We did this in D7 when handling Unicode, and all works well.

I don't think this advice about WideStrings is correct (at least, not for D2010.) A WideString is a non-refcounted wide (Unicode-capable) string, mostly used for COM I think. Prior to 2009 it was the only Unicode-capable string you could use, which is no longer the case. Also there's nothing about passing (Unicode) string-s to string functions that makes "the non-ascii-characters are replaced with question marks" (converts to ANSI) unless you are downcasting to an AnsiString. The simplest solution is not to do that - use "string" as your string type throughout your application.
David M
I explicitly mentioned that we observed this behavior in D7!The problem manifests itself when non-ascii literals are concatenated with the '+'-operator. Then they are implicitly converted to Ansi and non-asii-characters are replaced with question marks. This behavior was not caused by us explicitly casting to Ansi! The solution is simple: explicitly cast to WideString before concatenation.The unit tests that show this problem are right in front of me, so I am not making this up. We use String as the onlstring type in our code base.
+1  A: 

Although string is a Unicode type now, when you specify a length, you still get the non-Unicode ShortString type. The TestChar variable in your RemoveSpace function is a non-Unicode one-character string. What you should have been using all along is a real Char variable. I expect you came from the VB world, where one-character strings were the same as single characters. In Delphi, a string isn't the same as a character, so when you call Copy, you get a string.

In Unicode Delphi, that one-character string gets reduced to a non-Unicode string, and if there's no representation for that character in the current code page, you get a question mark instead. Fix it like this:

function RemoveSpace(const InStr: string): string;
var
  I: Integer;
  TestChar: Char;
begin
  Result := '';
  for I := 1 to Length(InStr) do
  begin
    TestChar := InStr[I];
    if TestChar <> ' ' then
      Result := Result + TestChar;
  end;
end;

I got rid of Ans. As of Turbo Pascal 7, you can use the implicitly declared Result variable instead of declaring your own and then assigning it to the function name. Result is readable and writable. Also, you don't need to worry about zero-length input. When the upper bound of a "for-to" loop is less than the lower bound, the loop simply doesn't run, so you don't need to check beforehand. Finally, I used the bracket operators on InStr to extract the character at the given index instead of getting a one-character-long string.

You say that your uses of Length and Copy are obviously incorrect, but you're wrong. Those functions continue to work just fine in Unicode. They know that Char is two bytes wide now, so if you call them on UnicodeString variables, you'll get the right characters. They also continue to work on AnsiString variables. In fact, they also work find on WideString variables, even in older Delphi versions.

The primary problem in your code was where you stored a Unicode character into a non-Unicode string type.

Rob Kennedy
I'd love an explanation for the downvotes for my answer and the question.
Rob Kennedy
+1 for explaining why string[1] was the problem. "Teach a man to fish" and all that.
Incredulous Monk
A: 

String[1] do not have unicode version

try Char instead.

cst_zf