views:

2488

answers:

6

Exactly that: Does a strings length equal the byte size? Does it matter on the language?

I think it is, but I just want to make sure.

Additional Info: I'm just wondering in general. My specific situation was PHP with MySQL.

As the answer is no, that's all I need know.

+25  A: 

Nope. A zero terminated string has one extra byte. A pascal string (the Delphi shortstring) has an extra byte for the length. And unicode strings has more than one byte per character.

By unicode it depends on the encoding. It could be 2 or 4 bytes per character or even a mix of 1,2 and 4 bytes.

Gamecat
In Delphi shortstring has one extra byte, but others string types has extra four bytes.
inzKulozik
I know, but the shortstrings are called pascal strings ;-).
Gamecat
Very nice answer, short and sweet, straight to the point, and includes the most common real-world examples.
Robert Gamble
+2  A: 

Not always, it depends on the encoding.

Malfist
+3  A: 

It depends on what you mean by "length". If you mean "number of characters" then, no, many languages/encoding methods use more than one byte per character.

Steven Robbins
+13  A: 

It entirely depends on the platform and representation.

For example, in .NET a string takes two bytes in memory per UTF-16 code point. However, surrogate pairs require two UTF-16 values for a full Unicode character in the range U+100000 to U+10FFFF. The in-memory form also has an overhead for the length of the string and possibly some padding, as well as the normal object overhead of a type pointer etc.

Now, when you write a string out to disk (or the network, etc) from .NET, you specify the encoding (with most classes defaulting to UTF-8). At that point, the size depends very much on the encoding. ASCII always takes a single byte per character, but is very limited (no accents etc); UTF-8 gives the full Unicode range with a variable encoding (all ASCII characters are represented in a single byte, but others take up more). UTF-32 always uses exactly 4 bytes for any Unicode character - the list goes on.

As you can see, it's not a simple topic. To work out how much space a string is going to take up you'll need to specify exactly what the situation is - whether it's an object in memory on some platform (and if so, which platform - potentially even down to the implementation and operating system settings), or whether it's a raw encoded form such as a text file, and if so using which encoding.

Jon Skeet
My what a mess we have!
Malfist
And of course the size on disk changes with/without a BOM. Just for extra fun ;-p
Marc Gravell
A: 

There's no single answer; it depends on language and implementation (remember that some languages have multiple implementations!)

Zero-terminated ASCII strings occupy at least one more byte than the "content" of the string. (More may be allocated, depending on how the string was created.)

Non-zero-terminated strings use a descriptor (or similar structure) to record length, which takes extra memory somewhere.

Unicode strings (in various languages) use two bytes per char.

Strings in an object store may be referenced via handles, which adds a layer of indirection (and more data) in order to simplify memory management.

joel.neely
+1  A: 

You are correct. If you encode as ASCII, there is one byte per character. Otherwise, it is one or more bytes per character.

In particular, it is important to know how this effects substring operations. If you don't have one byte per character, does s[n] get the nth byte or nth char? Getting the nth char will be inefficient for large n instead of constant, as it is with a one byte per character.

theschmitzer