Do certain characters take more bytes than others?

views:

264

answers:

Do certain characters take more bytes than others?

I'm not very experienced with lower level things such as howmany bytes a character is. I tried finding out if one character equals one byte, but without success.

I need to set a delimiter used for socket connections between a server and clients. This delimiter has to be as small (in bytes) as possible, to minimize bandwidth.

The current delimiter is "#". Would getting an other delimiter decrease my bandwidth?

No, all characters are 1 byte, unless you're using Unicode or wide characters (for accents and other symbols for example).

A character is 1 byte, or 8 bits, long which gives 256 possible combination to form characters with. 1 byte characters are called ASCII characters. They only use 7 bits (even though 8 are available, but you can't use this 8th bit) to form the standard alphabet and various symbols used when teletypes and typewriters were still common.

You can find an ASCII chart and what numbers correspond to what characters here.

samoz 2009-06-26 13:32:16

Almost everything in this response is wrong.

Michael Borgwardt 2009-06-26 13:37:36

@Michael Such as what?

samoz 2009-06-26 13:39:08

Such as the equation of characters and bytes, "1 byte characters are called ASCII characters", "you can't use this 8th bit". I suggest you read http://www.joelonsoftware.com/articles/Unicode.html very carefully.

Michael Borgwardt 2009-06-26 13:46:29

I just read the article you sent me and I still don't see how I'm glaringly wrong. He can still send ASCII characters (even if they are UTF-8) in 1 byte.And after thinking about it, the "can't use 8th bit" comment was wrong, it would just need some extra processing to strip out the 8th bit signal that he was sending.

samoz 2009-06-26 14:00:13

The most important thing that's wrong is that characters aren't bytes, and it also makes no sense to say that characters "are UTF-8" or "are Unicode or wide". Nor do characters have a length. You need an ENCODING to translate characters to bytes, and only then can you talk about length and which characters the encoding supports. And there certainly are encodings in which the characters supported by ASCII take more than 1 byte.

Michael Borgwardt 2009-06-26 14:12:58

I'm talking about when you type: char c, you get 1 byte allocated to you. The OP asked if he can use something smaller, to which the answer is no, because a byte is the smallest thing you can allocate. By character, I'm talking about the char type, not an actual letter. By larger characters, I'm talking about the wchar type.

samoz 2009-06-26 14:25:56

The OP didn't say what language he uses; C-specific answers that aren't even recognizable as such are not what he needs. BTW, your answer is wrong for C as well; the C standard indeed mandates that 1 char == 1 byte (and oh how much suffering that idiocy has caused), but it does NOT mandate 8-bit bytes and there are in fact architectures where bytes have more or fewer bits.

Michael Borgwardt 2009-06-26 15:09:37

+8 A:

This could help: www.joelonsoftware.com/articles/Unicode.html

Igor Drincic 2009-06-26 13:34:40

+4 A:

It depends on the encoding. In Single-byte character sets such as ANSI and the various ISO8859 character sets it is one byte per character. Some encodings such as UTF8 are variable width where the number of bytes to encode a character depends on the glyph being encoded.

ConcernedOfTunbridgeWells 2009-06-26 13:35:21

+4 A:

The answer of course is that it depends. If you are in a pure ASCII env, then yes, every char takes 1 byte, but if you are in a Unicode env (all of Windows for example), then chars can range from 1 to 4 bytes in size.

If you choose a char from the ASCII set, then yes your delimter is a small as possible.

Scott Weinstein 2009-06-26 13:38:41

+7 A:

It depends on what character encoding you use to translate between characters and bytes (which are not at all the same thing):

In ASCII or ISO 8859, each character is represented by one byte
In UTF-32, each character is represented by 4 bytes
In UTF-8, each character uses between 1 and 4 bytes
In ISO 2022, it's much more complicated

US-ASCII characters (of whcich # is one) will take only 1 byte in UTF-8, which is the most popular encoding that allows multibyte characters.

Michael Borgwardt 2009-06-26 13:43:19

US-ASCII characters take 1 byte in pretty much *any* encoding except for UTF-16 and UTF-32.

dan04 2010-08-21 03:54:19

ansaurus

tags:

views:

answers:

Do certain characters take more bytes than others?

related questions