views:

56

answers:

2

What's a good estimate/conversion/formula to figure out X# characters = Y# bytes?

A: 

For ANSI, I would think 1 byte to char but for unicode I would think 2 bytes per char. Although there are probably multi byte patterns too.

Curtis White
+5  A: 

It entirely depends on the encoding and potentially the data.

For UTF-16, if you know that all the characters are in the Basic Multilingual Plane, the answer will be bytes = 2 * characters.

For UTF-8, if everything is in the ASCII range, then bytes = characters - but if there are lots of Far Eastern characters, it could be as much as bytes = 3 * characters (and that's still assuming the Basic Multilingual Plane).

Other encodings obviously have different scenarios. Could you give more details about your situation (and your platform)? Do you want an accurate calculated value based on actual characters? Do you know anything about the text you're going to encode?

Jon Skeet
UTF-8 can use up to 4 bytes for a single character.
Christoffer Hammarström
@Christoffer: Even within the BMP? Not according to http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
Jon Skeet