views:

169

answers:

4

If I were creating a videogame level editor in AS3 or .NET with a string-based level format, that can be copied, pasted and emailed, how much data could I encode into each character? What is important is getting the maximum amount of data for the minimum amount of characters displayed on the screen, regardless of how many bytes the computer is actually using to store these characters.

For example if I wanted to store the horizontal position of an object in 1 string character, how many possible values could that have? Are there are any characters that can't be sent over a the internet, or that can't be copy and pasted? What difference would things like UTF8 make? Answers please for either AS3 or C#/.NET, or both.

2nd update: ok so Flash uses UTF16 for its String class. There are lots of control characters that I cannot use. How could I manage which characters are ok to use? Just a big lookup table? And can operating systems and browser handle UTF16 to the extent that you can safely copy and paste a UTF16 string into an email, notepad, etc?

+1  A: 

The number of different states a variable can hold is two to the power of the number of bits it has. How many bits a variable has is something that is likely to vary according to the compiler and machine used. But in most cases a char will have eight bits and two to the power eight is two hundred and fifty six.

Modern screen resolutions being what they are, you will most likely need more than one char for the horizontal position of anything.

Brian Hooper
ok, except most modern systems store more than 256 characters per char, right?
Iain
Not usually. It depends. Some character sets work by having 16 bits per character. Others work by having eight bits for the most common characters, and storing the rest in 16, 24 or even 32 bits. So a 'character' isn't a really well-defined data type, and doesn't always correspond to a C `char`.
Brian Hooper
Ok, I have clarified that I'm talking about a AS3 or C# printable string character
Iain
I'll bail out here as I don't know anything about C# or AS3. Sorry to have taken up your time. Although I might observe in passing that a string consists of many characters, and can be as long as you please.
Brian Hooper
`char` isn't the data type for a character. It's the data type for how big a character *was* when the language was designed. C predates Unicode, so it (usually) has an 8-bit `char`. Java was released when Unicode was a 16-bit encoding, so it has a 16-bit `char`.
dan04
+2  A: 

Updated: "update 1", "update 2"

You can store 8 Bits in a single charakter with ANSI, ASCII or UFT-8 encoding.

But, for exampel, if you whant to use ASCII-Encoding you should't use the first 5 bits (0001 1111 = 0x1F) and the chars 0x7F there are represent system-charaters like "Escape, null, start of text, end of text ..) who are not can be copy and past. So you could store 223 (1110 0000 = 0xE0) different informations in one singel charakter.

If you use UTF-16 you have 2 bytes = 16 bits - system-characters to store your informationen.

A in UTF-8 Encoding: 0x0041 (the first 2 digits are every 0!) or 0x41
A in UTF-16 Encoding: 0x0041 (the first 2 digits can be higher then 0) 
A in ASCII Encoding: 0x41 
A in ANSI Encoding: 0x41

see images at the and of this post!

update 1:

if you not need to modify the values without any tool (c#-tool, javascript-base webpage, ...) you can alternative base64 or zip+base64 your informationens. this solution avoid the problem that you descript in your 2nd update. "here are lots of control characters that I cannot use. How could I manage which characters are ok to use?"

If this is not an option you can not avoid to use any type of lookup-table. the shortest way for an lookuptable are:

var illegalCharCodes = new byte[]{0x00, 0x01, 0x02, ..., 0x1f, 0x7f};

or you code it like this:

//The excampel based on ASNI-Encoding but in principle its the same with utf-16
var value = 0;
if(charcode > 0x7f)
  value = charcode - 0x1f - 1; //-1 because 0x7f is the first illegalCharCode higher then 0x1f
else
  value = charcode - 0x1f;
value -= 1; //because you need a 0 value;
//charcode: 0x20 (' ') -> value: 0
//charcode: 0x21 ('!') -> value: 1
//charcode: 0x22 ('"') -> value: 2
//charcode: 0x7e ('~') -> value: 94
//charcode: 0x80 ('€') -> value: 95
//charcode: 0x81 ('�') -> value: 96
//..

update 2:

for Unicode (UTF-16) you can use this table: http://www.tamasoft.co.jp/en/general-info/unicode.html Any character represent with a symbol like or are empty you should not use. So you can not store 50,000 possible values in one utf-16 character if you allow to copy and past them. you need any spezial-encoder and you must use 2 UTF-16 character like:

//charcode: 0x0020 + 0x0020 ('  ') > value: 0
//charcode: 0x0020 + 0x0020 (' !') > value: 2
//charcode: 0x0020 + 0x0020 ('!A') > value: something higher 40.000, i dont know excatly because i dont have count the illegal characters in UTF-16 :D

ASCII-Table ASCII-Table extended

Floyd
Correction: an UTF-8 character can be anywhere from 1 to 4 bytes.
Piskvor
UTF-8 (8-bit Unicode Transformation Format)UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes).Source: http://en.wikipedia.org/wiki/UTF-8
Floyd
you have right .. my english isnt the best :D
Floyd
+2  A: 

In C, a char is a type of integer, and it's most typically one byte wide. One byte is 8 bits so that's 2 to the power 8, or 256, possible values (as noted in another answer).

In other languages, a 'character' is a completely different thing from an integer (as it should be), and has to be explicitly encoded to turn it into a byte. Java, for example, makes this relatively simple by storing characters internally in a UTF-16 encoding (forgive me some details), so they take up 16 bits, but that's just implementation detail. Different encodings such as UTF-8 mean that a character, when encoded for transmission, could occupy anything from one to four bytes.

Thus your question is slighly malformed (which is to say it's actually several distinct questions in one).

How many values can a byte have? 256.

What characters can be sent in emails? Mostly those ASCII characters from space (32) to tilde (126).

What bytes can be sent over the internet? Any you like, as long as you encode them for transmission.

What can be cut-and-pasted? If your platform can do Unicode, then all of unicode; if not, not.

Does UTF-8 make a difference? UTF-8 is a standard way of encoding a string of characters into a string of bytes, and probably not much to do with your question (Joel Spolsky has a very good account of The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)).

So pick a question!

Edit, following edit to question Aha! If the question is: 'how do I encode data in such a way that it can be mailed?', then the answer is probably 'Use base64'. That is, if you have some purely binary format for your levels, then base64 is the 'standard' (very much quotes-standard) way of encoding that binary blob in a way that will make it through mail. The things you want to google for are 'serialization' and 'deserialization'. Base64 is probably close to the practical maximum of information-per-mailable-character.

(Another answer is 'use XML', but the question seems to imply some preference for compactness, and that a basically binary format is desirable).

Norman Gray
Great answer, thanks. I have edited my question slightly to try and be more clear.
Iain
+3  A: 

Confusingly, a char is not the same thing as a character. In C and C++, a char is virtually always an 8-bit type. In Java and C#, a char is a UTF-16 code unit and thus a 16-bit type.

But in Unicode, a character is represented by a "code" point that ranges from 0 to 0x10FFFF, for which a 16-bit type is inadequate. So a character must either be represented by a 21-bit type (in practice, a 32-bit type), or use multiple "code units". Specifically,

  • IN UTF-32, all characters require 32 bits.
  • In UTF-16, characters U+0000 to U+FFFF (the "basic multilingual plane"), except for U+D800 to U+DFFF which cannot be represented, require 16 bits, and all other characters require 32 bits.
  • In UTF-8, characters U+0000 to U+007F (the ASCII reportoire) require 8 bits, U+0080 to U+07FF require 16 bits, U+0800 to U+FFFF require 24 bits, and all other characters require 32 bits.

If I were creating a videogame level editor with a string-based level format, how much data could I encode into each char? For example if I wanted to store the horizontal position of an object in 1 char, how many possible values could that have?

Since you wrote char rather than "character", the answer is 256 for C and 65,536 for C#.

But char isn't designed to be a binary data type. byte or short would be more appropriate.

Are there are any characters that can't be sent over a the internet, or that can't be copy and pasted?

There aren't any characters that can't be sent over the Internet, but you have to be careful using "control characters" or non-ASCII characters.

Many Internet protocols (especially SMTP) are designed for text rather than binary data. If you want to send binary data, you can Base64 encode it. That gives you 6 bits of information for each byte of the message.

dan04
Thanks, great answer - I have edited my question to be more clear.
Iain
in c# an char represent a charcode of unicode. unicode hase 2x8 bits for each char = 65,536
Floyd
@floyddotnet: Unicode *used* to be 16 bits, but it has since been expanded to 21 bits. A `char` in C# represents a UTF-16 code unit, *not* a Unicode code point.
dan04