views:

925

answers:

8

Hi,

Delphi 2009 has changed its string type to use 2 bytes to represent a character, which allows support for unicode char sets. Now when you get sizeof(string) you get length(String) * sizeof(char) . Sizeof(char) currently being 2.

What I am interested in is whether anyone knows of a way which on a character by character basis it is possible to find out if it would fit in a single byte, eg find out if a char is ascii or Unicode.

Thanks

+1  A: 

You could check the value of the character:

if ord(c) < 128 then
    // is an ascii character
Greg Hewgill
Thanks Greg, I should have thought about it a bit longer.
Toby Allen
Since you are using D2009 anyway, look at the new TCharacter class, ie: if TCharacter.IsLatin1(c) then
Remy Lebeau - TeamB
A: 

An ASCII character always fits in one byte. You can't say the same for a unicode character since that depends how it is encoded. You can't see from a single byte if it is an ASCII or unicode character or if it is a character at all for that matter. So what is your question again? And why do you need to know? My guess is you misunderstood unicode or I misunderstood your question.

Lars Truijens
Probably the latter :)
Toby Allen
+2  A: 

If you don't want to use Unicode in Delphi 2009, you can use the AnsiString type. But why should you.

A cumbersome, but valid test could be:

function IsAnsi(const AString: string): Boolean;
var
  tempansi : AnsiString;
  temp : string;
begin
  tempansi := AnsiString(AString);
  temp := tempansi;
  Result := temp = AString;
end;
Gamecat
I'm thinking that the AnsiString should be forced to a specific codepage also such as AnsiString(CP_UTF8).
skamradt
@skamradt Wouldn't AnsiString(CP_UTF8) defeat the whole purpose of the function? All unicode strings can be represented in UTF-8 also, so the check will always return true.
Otherside
A: 

What I'm primarily interested in is knowing before my string goes to a database (oracle, Documentum) how many bytes the string will use up.

We need to be able to enforce limits before hand and ideally (as we have a large installed base) without having to change the database. If a string field allows 12 bytes, in delphi 2009 a string of length 7 would always show as using 14 bytes even though once it got to the db it would only use 7 if ascii or 14 if double byte, or somewhere in between if a mixture.

So my interest really is being able to make that calculation before hand. Gregs answer goes some of the way to possibly facilitating that.

Toby Allen
Then see Michael Madsen's answer (http://stackoverflow.com/questions/190598/is-there-a-way-to-see-if-a-character-is-using-1-or-2-bytes-in-delphi-2009#191586)
Lars Truijens
Expanding your question via an answer really is not the best way. It's much better to edit your question and add this info, so that everyone gets all the info in the question at the top.
Otherside
+4  A: 

First of all, keep in mind that your database lengths may really be in characters, not bytes - you'll have to check the documentation for the datatype. I'm going to assume it really is the latter for the purpose of the question.

The amount of bytes your string will use depends entirely on the character encoding it'll be stored with. If it's UTF-16, the default string type in Delphi, then it will always be 2 bytes per character, excluding surrogates.

The most likely encoding, assuming the database uses a Unicode charset, however, is UTF-8. This is a variable length encoding: characters can require anywhere between 1 and 4 bytes, depending on the character. You can see a chart on Wikipedia of how the ranges are mapped.

However, if you're not changing the database schema at all, then that must mean one of three things:

  1. You currently store everything in a binary way, instead of a textual way (not usually a good choice)
  2. The database already stores Unicode and counted characters, not bytes (otherwise, you'd have the problem now, more so in the case of accented letters)
  3. The database stores in a single-byte codepage, such as Windows-1252, preventing you from storing Unicode data at all (making it a non-issue, because characters will be stored the same way as before, although you can't make use of Unicode)

I'm not familiar with Oracle, but if you look at MSSQL, they have two different datatypes: varchar and nvarchar. Varchar counts in bytes, while nvarchar counts in characters, therefore being suitable for Unicode. MySQL, on the other hand, only has varchar, and it always counts in characters (as of 4.1). You should therefore check the Oracle documentation and your database schema to get a decisive answer on whether or not it's a problem at all.

Michael Madsen
A: 

Hi, Since with AnsiString 1 char = 1 byte and with Unicode String 1 char = 2 bytes, the simple test to perform is IsAnsiString:= sizeof(aString)=length(aString);

Unless I'm mistaken, SizeOf(String) will return 4 in all 32-bit versions of Delphi, because String (either AnsiString or UnicodeString) is a pointer-type. Thus SizeOf() will return the size of the pointer. Length(String) returns the number of characters, so this check of yours won't work.
PatrickvL
+1  A: 

You replied that you really want to find out how many bytes your string will take up.

How about converting to UTF8String? Ansi characters will take up 1 byte. Keep in mind that in UTF-8, Unicode characters may take more than 2 bytes.

Bruce McGee
+2  A: 

You can use StringElementSize function to find out if a string is Unicode or ANSI. To check if a character is ANSI, use TCharacter.IsAnsi class function in Character.pas unit.

vcldeveloper