views:

303

answers:

5

We are upgrading our project from Delphi 2006 to Delphi 2010. Old code was:

InputText: string;
InputText := SomeTEditComponent.Text;
...
for i := 1 to length(InputText) do
if InputText[i] in ['0'..'9', 'a'..'z', 'Ř' { and more special characters } ] then ...

Trouble is with accent letters - compare will fail.

I tried switch source code from ANSI to UTF8 and LE UCS-2 but without luck. Only cast as AnsiChar works:

if CharInSet(AnsiChar(InputText[i]), ['0'..'9', 'a'..'z', 'Ř']) then

Funny is how Delphi works with that letters - try this in Evaluate during debugging:

Ord('Ř') = Ord('Ø')

(yes, Delphi says True, on Windows 7 Czech)


Question is: How can I store and compare simple strings without forcing them as AnsiStrings? Because if this is not working why we should use Unicode?

Thanks all for reply

Right now we are using in some parts simple CharInSet(AnsiChar(...

+4  A: 

The declaration of CharInSet is

function CharInSet(C: AnsiChar; const CharSet: TSysCharSet): Boolean; overload; inline;
function CharInSet(C: WideChar; const CharSet: TSysCharSet): Boolean; overload; inline;

while TSysCharSet is

TSysCharSet = set of AnsiChar;

Thus CharInSet can only compare to a set of AnsiChar. That is why your accented character is converted to AnsiChar.

There is no equivalent to a set of WideChar as sets are limited to 256 elements. You have to implement some other means to check the character.

Something like

const
  specials: string = 'Ř';

if CharInSet(InputText[i], ['0'..'9', 'a'..'z']) or (Pos(InputText[I], specials) > 0) then 

might be a try. You can add more characters to specials as needed.

Uwe Raabe
Sadly you can not keep using inline sets [...] and then stick unicode literals in there.
Warren P
+1  A: 

You should either use IFs instead of IN or find a WideCharSet implementation. This might help if you have a lot of sets: http://code.google.com/p/delphilhlplib/source/browse/trunk/Library/src/Extensions/DeHL.WideCharSet.pas.

alex
If (blah) or (blah) or (blah) gets pretty crazy when you have to check six or twelve characters.
Warren P
+1  A: 

As mentioned by Uwe Raabe, the problem with Unicode char is, they're pretty large. If Delphi allowed you to create an "set of Char" it would be 8 Kb in size! An "set of AnsiChar" is only 32 bytes in size, pretty manageable.

I'd like to offer some alternatives. First is a sort of drop-in replacement for the CharInSet function, one that uses an array of CHAR to do the tests. It's only merit is that it can be called immediately from almost anywhere, but it's benefits stop there. I'd avoid this if I can:

function UnicodeCharInSet(UniChr:Char; CharArray:array of Char):Boolean;
var i:Integer;
begin
  for i:=0 to High(CharArray) do
    if CharArray[i] = UniChr then
    begin
      Result := True;
      Exit;
    end;
  Result := False;
end;

The trouble with this function is that it doesn't handle the x in ['a'..'z'] syntax and it's slow! The alternatives are faster, but aren't as close to a drop-in replacement as one might want. The first set of alternatives to be investigated are the string functions from Microsoft. Amongst them there's IsCharAlpha and IsCharAlphanumeric, they might fix lots of issues. The problem with those, all "alpha" chars are the same: You might end up with valid Alpha chars in non-enlgish non-czech languages. Alternatively you can use the TCharacter class from Embarcadero - the implementation is all in the Character.pas unit, and it looks effective, I have no idea how effective Microsoft's implementation is.

An other alternative is to write your own functions, using an "case" statement to get things to work. Here's an example:

function UnicodeCharIs(UniChr:Char):Boolean;
var i:Integer;
begin
  case UniChr of
    'ă': Result := True;
    'ş': Result := False;
    'Ă': Result := True;
    'Ş': Result := False;
    else Result := False;
  end;
end;

I inspected the assembler generated for this function. While Delphi has to implement a series of "if" conditions for this, it does it very effectively, way better then implementing the series of IF statements from code. But it could use a lot of improvement.

For tests that are used ALOT you might want to look for some bit-mask based implementation.

Cosmin Prund
UnicodeCharInSet as shown above is going to be VERY slow. Try TBits instead.
Warren P
@Warren P: Yes it's slow, it's so slow I'd never use it for production code (I'd use it for code that only runs on my machine). Unfortunately all the alternatives require preparing functions or data structures in advance. What used to be an "Char in Set" one-liner turns into half a page of code. That's the only merit of the function, it keeps an one-liner on one line.
Cosmin Prund
TBits is one liner.
Warren P
I actually like your function idea though Probably very very fast and keeps your code more readable: IsSpecialAccent(foo)
Warren P
@Warren P: How's TBits an one-liner? Can you provide an link or sample?
Cosmin Prund
Picked just because it is nearest to Char in [set] language construct
DiGi
+2  A: 

Don't rely on the encoding of your Delphi source code files.

It might be mangled when using any non-Unicode tool to work on your text files (or even buggy Unicode aware tools).

The best way is to specify your characters as a 4-digit Unicode code point.

const MyEuroSign = #$20AC;

See also my blog posting about this.

--jeroen

Jeroen Pluimers
Much better idea!
Warren P
This is bit unreal if you need many in-your-language common letters - and put them into message boxes and exceptions.
DiGi
@DiGi: In your situation, you must then assure that you save all your files as Unicode, and you only use Unicode proof tools. I have seen too many occasions where just a slight oversight caused a character set mismatch, and created a lot of havoc. So you need to choose between two evils: less readable but guaranteed to work, or readable but having a chance of failure.
Jeroen Pluimers
+1  A: 

You have stumbled onto a case where an idiom from Pre-Unicode Pascal should not be translated directly into the most visually similar idiom in Unicode era pascal.

First, let's deal with unicode string literals. If you can always be sure you will never have any body ever use your source code with any tool that could mess up your encodings then you could use Unicode literals. Personally, I would not like to see Unicode codepoints in string literals in any of my code, for various reasons, the strongest reason being that my code may need to be reviewed for internationalization at some point, and having literals that belong to your local language peppered through your code is even more of a problem when you use a language other than those which use the simple Ascii/Ansi codepage symbols. Your source code will be more readable if you keep in mind the assumption that your accented characters, and even non-accented character literals would be better declared, as Jeroen says to declare them, in the const section, away from your actual place in the code that you use them.

Consider the case where you use the same string literal thirty three times throughout your code. Why should it be repeated instead of a constant? And even when it is used only once, isn't the code more readable if you declare a sane constant name?

So, first you should declare constants like he shows.

Second, the CharInSet function is deprecated for all uses other than the use it was intended for which is where you must continue to use the "Set of AnsiChar" types. This is no longer a recommended approach in Delphi 2009/2010, and using arrays of literal unicode characters, in your constant section, would be more readable, and more up-to-date.

I suggest you use the JCL StrContainsChars function and avoid character sets, since you can not declare an inline SET of Unicode Characters at all, the language does not allow it. Instead use this, and be sure to comment it:

implementation
uses
   JclStrings;

    const
       myChar1 = #$2001;
       myChar2 = #$2002;
       myChar3 = #$2003;
       myMatchList1 : Array[0..2] of Char = (myChar1,myChar2,myChar3);




function Match(s:String):Boolean;
begin
        result := StrContainsChars( s, myMatchList1,false);

end;

String, and Character Literals are bad to have peppering your code, especially character or numeric literals, are called "Magic values" and are to be avoided.

P.S. Your debug assertion shows that Ord('?') is downcasting the unicode character quietly to an AnsiChar byte-size character in the debugger. This behaviour is unexpected and should probably logged in QC.

Warren P
thx for extending my answer.
Jeroen Pluimers