views:

335

answers:

7

Today I noticed that C#'s String class returns the length of a string as an Int. Since an Int is always 32-bits, no matter what the architecture, does this mean that a string can only be 2GB or less in length?

A 2GB string would be very unusual, and present many problems along with it. However, most .NET api's seem to use 'int' to convey values such as length and count. Does this mean we are forever limited to collection sizes which fit in 32-bits?

Seems like a fundamental problem with the .NET API's. I would have expected things like count and length to be returned via the equivalent of 'size_t'.

+5  A: 

Correct, the maximum length would be the size of Int32, however you'll likely run into other memory issues if you're dealing with strings larger than that anyway.

Evan Trimboli
This applies to more than string though. It applies to most all collections.
Andrew
@Andrew - The answer covers that statement too. If you have a collection approaching 2 GB you are going to have other issues as well.
David Basarab
Suppose it's the year 2060 and I'm working on an application on my ultra-modern PC which requires collections with more than an int's worth of items. What problems might I have?
Andrew
@Andrew, first of all using .NET in 2060 is a problem.
Jim Schubert
@Jim Schubert, I bet someone said the same thing about using COBOL in 2010 :)
Giovanni Galbo
@Giovanni: by 2060, I hope IT managers will have learned from their mistakes. Dijkstra knew it in the '70s: "The use of COBOL cripples the mind; its teaching should, therefore, be regarded as a criminal offense." I'm sure COBOL will still be used in 2060, since most IT departments are slower to make decisions than Congress.
Jim Schubert
+14  A: 

Seems like a fundamental problem with the .NET API...

I don't know if I'd go that far.

Consider almost any collection class in .NET. Chances are it has a Count property that returns an int. So this suggests the class is bounded at a size of int.MaxValue (2147483647). That's not really a problem; it's a limitation -- and a perfectly reasonable one, in the vast majority of scenarios.

Anyway, what would the alternative be? There's uint -- but that's not CLS-compliant. Then there's long...

What if Length returned a long?

  1. An additional 32 bits of memory would be required anywhere you wanted to know the length of a string.
  2. The benefit would be: we could have strings taking up billions of gigabytes of RAM. Hooray.

Try to imagine the mind-boggling cost of some code like this:

// Lord knows how many characters
string ulysses = GetUlyssesText();

// allocate an entirely new string of roughly equivalent size
string schmulysses = ulysses.Replace("Ulysses", "Schmulysses");

Basically, if you're thinking of string as a data structure meant to store an unlimited quantity of text, you've got unrealistic expectations. When it comes to objects of this size, it becomes questionable whether you have any need to hold them in memory at all (as opposed to hard disk).

Dan Tao
I don't see how it's reasonable. Since .NET defines an int to be 32 bits, that means 50 years from now...no matter what my computer can handle, .NET will be restricting me to 32-bit size collections. Sounds like a modern variation of '640Kb is enough for anyone'.
Andrew
@Andrew, in 50 years, you won't be programming in .NET. And in 50 years, int.MaxValue would still be a large number of objects to hold in a collection.
Anthony Pegram
@Andrew then create a wrapper around a multidimensional `List<>/Array` and have it return a `Int64` for `Count`
Earlz
Seems like a stupid arbitrary limitation. C handles this much better.
Andrew
The problem with "640Kb" is that it was obsolete in a very short time. In contrast, 50 years is a very long time in this industry. Vast majority of languages and technologies in use today did not exist 50 years ago, and most technologies in use back then did not survive to see this day (indeed, C, ancient as it is among its peers today, is only 38 years old). I don't think .NET string length limits will be a concern by that time.
Pavel Minaev
@Dan Tao's edit: This isn't true. C handles these scenarios very well with the 'size_t' type.
Andrew
@Andrew: You have to evaluate this particular fact in the context of the CLS as a whole, though. Maybe in 50 years it will seem absurd to cap strings at ~2 billion characters because we'll be absolutely swimming in memory; I don't know. But what seems far *more* relevant is whether or not 2 billion (or even 9 quintillion) will seem a reasonable cap on an integral data type. If those limits are no longer practical, then the CLS as it exists today will not be around anymore.
Dan Tao
+1 this, and @Pavel's expansion as well.
Dean Harding
It will be far less than 50 years before this assumption is obsolete. My computer's RAM grew more than 5 orders of magnitude in the past 20 years. I remember when I couldn't imagine how I'd ever use 64MB of RAM, yet today I don't think twice about loading a mere 64MB text file into a string for processing.
Ken
+1  A: 

At some value of String.length() probably about 5MB its not really practical to use String anymore. String is optimised for short bits of text.

Think about what happens when you do

msString += " more chars"

Something like:

System calculates length of myString plus length of " more chars"

System allocates that amount of memory

System copies myString to new memory location

System copies " more chars" to new memory location after last copied myString char

The original myString is left to the mercy of the garbage collector.

While this is nice and neat for small bits of text its a nightmare for large strings, just finding 2GB of contiguous memory is probably a showstopper.

So if you know you are handling more than a very few MB of characters use one of the *Buffer classes.

James Anderson
Even the buffer classes return an int for things like length.
Andrew
A: 

Even in x64 versions of Windows I got hit by .Net limiting each object to 2GB.

2GB is pretty small for a medical image. 2GB is even small for a Visual Studio download image.

Windows programmer
This is my concern. It seems like most of the API's .NET provides use an int for things like 'count' or 'length'.
Andrew
@Michael - I don't care so much about strings in particular, it was just an example to get people attention.
Andrew
Seems like someone hit that problem with `Array` early on, since it has a 64-bit `LongLength` property.
devstuff
@devstuff: In Microsoft's implementation, `LongLength` just returns the 32-bit `Length` cast to a `long`! Besides, the CLR's 2GB object size restriction means that the only arrays that could get anywhere near having `int.MaxValue` elements would be `bool[]` or `byte[]`. (I'm not sure if Mono is subject to the same restrictions.)
LukeH
+1  A: 

It's pretty unlikely that you'll need to store more than two billion objects in a single collection. You're going to incur some pretty serious performance penalties when doing enumerations and lookups, which are the two primary purposes of collections. If you're dealing with a data set that large, There is almost assuredly some other route you can take, such as splitting up your single collection into many smaller collections that contain portions of the entire set of data you're working with.

Heeeey, wait a sec.... we already have this concept -- it's called a dictionary!

If you need to store, say, 5 billion English strings, use this type:

Dictionary<string, List<string>> bigStringContainer;

Let's make the key string represent, say, the first two characters of the string. Then write an extension method like this:

public static string BigStringIndex(this string s)
{
    return String.Concat(s[0], s[1]);
}

and then add items to bigStringContainer like this:

bigStringContainer[item.BigStringIndex()].Add(item);

and call it a day. (There are obviously more efficient ways you could do that, but this is just an example)

Oh, and if you really really really do need to be able to look up any arbitrary object by absolute index, use an Array instead of a collection. Okay yeah, you use some type safety, but you can index array elements with a long.

Warren
Even if you could index into an array with a `long` it would currently be pretty useless: The CLR has a max object size limit of 2GB, so it's impossible for an array to have more than `int.MaxValue` elements anyway (and it could only get near that limit if it was a `bool[]` or `byte[]` array with single-byte elements). *This restriction applies to Microsoft's current implementation, I'm not sure about Mono.*
LukeH
A: 

If you are working with a file that is 2GB, that means you're likely going to be using a lot of RAM, and you're seeing very slow performance.

Instead, for very large files, consider using a MemoryMappedFile (see: http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx). Using this method, you can work with a file of nearly unlimited size, without having to load the whole thing in memory.

Robert Seder
Please put a comment if you mark an answer down. In what way was this not useful or correct, I wonder???
Robert Seder
+1  A: 

The fact that the framework uses Int32 for Count/Length properties, indexers etc is a bit of a red herring. The real problem is that the CLR currently has a max object size restriction of 2GB.

So a string -- or any other single object -- can never be larger than 2GB.

Changing the Length property of the string type to return long, ulong or even BigInteger would be pointless since you could never have more than approx 2^30 characters anyway (2GB max size and 2 bytes per character.)

Similarly, because of the 2GB limit, the only arrays that could even approach having 2^31 elements would be bool[] or byte[] arrays that only use 1 byte per element.

Of course, there's nothing to stop you creating your own composite types to workaround the 2GB restriction.

(Note that the above observations apply to Microsoft's current implementation, and could very well change in future releases. I'm not sure whether Mono has similar limits.)

LukeH
@Luke - do you have any references for this?
Russell
@Russell: "As with 32-bit Windows operating systems, there is a 2GB limit on the size of an object you can create while running a 64-bit managed application on a 64-bit Windows operating system." http://msdn.microsoft.com/en-us/library/ms241064.aspx
LukeH
@Russell: There's also an interesting blog article here, with an example of a workaround composite object: http://blogs.msdn.com/b/joshwil/archive/2005/08/10/450202.aspx
LukeH
@Russell: And a couple of interesting SO discussions: http://stackoverflow.com/questions/1087982/single-objects-still-limited-to-2-gb-in-size-in-clr-4-0 and http://stackoverflow.com/questions/573692/is-the-size-of-an-array-constrained-by-the-upper-limit-of-int-2147483647
LukeH