views:

316

answers:

6

Is this possible? Given that C# uses immutable strings, one could expect that there would be a method along the lines of:

var expensive = ReadHugeStringFromAFile();
var cheap = expensive.SharedSubstring(1);

If there is no such function, why bother with making strings immutable? Or, alternatively, if strings are already immutable for other reasons, why not provide this method?

The specific reason I'm looking into this is doing some file parsing. Simple recursive descent parsers (such as the one generated by TinyPG, or ones easily written by hand) use Substring all over the place. This means if you give them a large file to parse, memory churn is unbelievable. Sure there are workarounds - basically roll your own SubString class, and then of course forget about being able to use String methods such as StartsWith or String libraries such as Regex, so you need to roll your own version of these as well. I assume parser generators such as ANTLR basically do that, but my format is simple enough not to justify using such a monster tool. Even TinyPG is probably an overkill.

Somebody please tell me I am missing some obvious or not-so-obvious standard C# method call somewhere...

+5  A: 

No, there's nothing like that.

.NET strings contain their text data directly, unlike Java strings which have a reference to a char array, an offset and a length.

Both solutions have "wins" in some situations, and losses in others.

If you're absolutely sure this will be a killer for you, you could implement a Java-style string for use in your own internal APIs.

Jon Skeet
That's what I was afraid of. Sigh. Thanks...
Oren Ben-Kiki
Ah, I was just wondering about this earlier today.
Michael Myers
+1  A: 

The .NET framework supports string interning. This is a partial solution but does not offer the posibility to reuse parts of a string. I think reusing substring will cause some problems not that obviouse at a first look. If you have to do a lot of string manipulation using the StringBuilder is the way to go.

Daniel Brückner
String interning (in both Java and C#) is brain-dead in that it requires you to construct a string object before you call Intern. For something with the stated purpose of saving memory usage, you'd expect to be able to pass it a StringBuilder or a (slice of a) character array or something like that. But, no, "that would be too easy".
Oren Ben-Kiki
That's true ... but the GC ... :D
Daniel Brückner
Since I can't apply a Regex to the (prefix of) a StringBuilder without constructing a string (and hence copying the characters), you still run into the problem of copying 5G of text around when parsing a 1M file. I could just say that none of my tokens are longer than X characters and repeatedly generate prefix sub-strings from the StringBuilder, that may just work... simple it won't be. Sigh.
Oren Ben-Kiki
+2  A: 

As far as I know, all larger parsers use streams to parse from. Isn't that suitable for your situation?

Paco
Nope, because I can't apply a Regex to a stream, or even do a simple non-destructively test whether it starts with a literal string. So directly parsing a stream requires re-implementing the whole complexity of a Regex (at least), which is what ANTLR does, which is why I called it a "monster" compared to something trivial like TinyPG. However, TinyPG repeatedly calls Substring to get the rest of the input. Give it a 1MB file which is split to 100,000 tokens and you end up copying 5GB of memory. Yikes!
Oren Ben-Kiki
A: 

Nothing in C# provides you the out-of-the-box functionality you're looking for.

What want is a Rope data structure, an immutable data structure which supports O(1) concats and O(log n) substrings. I can't find any C# implementations of a rope, but here a Java one.

Barring that, there's nothing wrong with using TinyPG or ANTLR if that's the easiest way to get things done.

Juliet
A: 

Well you could use "unsafe" to do the memory management yourself, which might allow you to do what you are looking for. Also the StringBuilder class is great for situations where a string needs to be manipulated numerous times, since it doesn't make a new string with each manipulation.

Zensar
I'd love to do that except that you can't apply a Regex to a StringBuffer :-(
Oren Ben-Kiki
A: 

You could easily write a trivial class to represent "cheap". It would just hold the index of the start of the substring and the length of the substring. A couple of methods would allow you to read the substring out when needed - a string cast operator would be ideal as you could use

string text = myCheapObject;

and it would work seamlessly as if it were an actual string. Adding support for a few handy methods like StartsWith would be quick and easy (they'd all be one liners).

The other option is to write a regular parser and store your tokens in a Dictionary from which you share references to the tokens rather than keeping multiple copies.

Jason Williams
If you create a CheapString class it wouldn't work with regex
AlbertEin