views:

149

answers:

5

I have a dataset. This dataset will serve a lookup table. Given a number, I should be able to lookup a corresponding value for that number.

The dataset (let's say its CSV) has a few caveats though. Instead of:

1,ABC
2,XYZ
3,LMN

The numbers are ranges (- being "through", not minus):

1-3,ABC     // 1, 2, and 3 = ABC
4-8,XYZ     // 4, 5, 6, 7, 8 = XYZ
11-11,LMN   // 11 = LMN

All the numbers are signed ints. No ranges overlap with another ranges. There are some gaps; there are ranges that aren't defined in the dataset (like 9 and 10 in the last snippet above). `

How might I model this dataset in C# so that I have the most-performant lookup while keeping my in-memory footprint low?

The only option I've come up with suffers from overconsumption of memory. Let's say my dataset is:

1-2,ABC
4-6,XYZ

Then I create a Dictionary<int,string>() whose key/values are:

1/ABC
2/ABC
4/XYZ
5/XYZ
6/XYZ

Now I have hash performance-lookup, but tons of wasted space in the hash table.

Any ideas? Maybe just use PLINQ instead and hope for good performance? ;)

+4  A: 
arootbeer
+1 Very interesting, doubling the constant which still leaves it O(1)
Davy8
Well actually you would get O(2) since you're doing a lookup in two hash tables, but yeah, nifty.
SnickersAreMyFave
@SnickersAreMyFave: The constant is removed out in Big-O, so it's still O(1). O(constant) is always O(1), no matter what the constant.
Reed Copsey
So how do you express that it takes twice as long, then? Would you say the complexity for this algorithm above is: O(1) + O(1)?
SnickersAreMyFave
How large do you think this data structure would need need to be to store the two range keys: {1,100000}, {100000,80000000}? This can rapidly become a more expensive options than just performing a binary-search solution.
LBushkin
@Snickers there is no such thing as O(1) + O(1) or O(2). There is no way to represent that something takes twice as long in Big-O notation. It doesn't measure things on that "small" of a scale.
Davy8
If something takes 99n^2 +36n + 158 time, it has the same Big-O as 0.0000n^2. They're both O(n^2).
Davy8
I must be missing something. How would this save any memory? It seems to me it would just cost more.
Dan Tao
It really only saves memory in the situation described by the question: each value in the original dictionary is a unique copy of the string referenced, rather than having each value contain a reference to the same string (as other answers have pointed out).
arootbeer
@arootbeer: Where would these unique copies come from? I don't mean to be argumentative, but I really can't imagine *any* scenario where this would be the case... presumably, if the OP wants to define a value for a specified range of keys, he/she must *have* a `string` representing that value and can use this instance for each of the `int` keys within that range. If the string is known at compile-time, of course, it is interned; but even if it is only determined at run-time (e.g., if supplied by user input), it seems the OP would have to *deliberately* make a bunch of copies to waste memory.
Dan Tao
To propose a scenario, imagine that each string was produced by calling `ToString` on an object (using an override that doesn't `Intern`), rather than by assigning an intermediate variable. It's not something you *should* do, but it's something you easily *could* do.
arootbeer
@arootbeer: I see what you mean. But this would essentially amount to a mistake on the developer's part. It seems (to me) that to use a `Dictionary<int, int>` which merely provides a key with which to perform a lookup on a `Dictionary<int, string>` would be quite a lot of effort simply to avoid creating duplicate string instances, which the developer could do far more easily by simply calling `ToString` *once* and using the resulting value for each key within a range.
Dan Tao
@Dan Tao: That's true. But if you desired to go to that effort, this seems to me to be a straightforward way to go about it :)
arootbeer
@arootbeer: All right, I'll let it go... ;)
Dan Tao
A: 

arootbeer has a good solution, but one you may find confusing to work with.

Another choice is to use a reference type instead of a string, so that you point to the same reference

class StringContainer { 
    public string Value { get; set; }
}

Dictionary<int, StringContainer> values;

var value1 = new StringContainer { Value = "ABC" };
values.Add(1, value1);
values.Add(2, value1);

They will both point to the same instance of StringContainer

EDIT: Thanks for the comments everyone. This method handles value types other than string, so it might be useful for more than the given example. Also, it is my understanding that strings don't always behave in the manner you would expect from reference values, but I could be wrong.

Luke Schafer
Strings _are_ reference types in C# / .NET
Wesley Hill
Yep, strings are actually references types. I had thought that the CLR actually caches and reuses string instances. Since they are immutable in .NET this is a safe thing to do. I can't find anything to confirm this though and don't have access to .NET on Windows right now.
Sean Copenhaver
You could also call `string.Intern` for the same effect without creating a new class. Also if the value of the string is known at compile time it is automatically interned. http://msdn.microsoft.com/en-us/library/system.string.intern.aspx
Davy8
I think I read somewhere that strings are actually mutable in IL, but that functionality isn't used in any (popular?) .NET languages. I can't find the reference though unfortunately.
Davy8
@Davy8 ah, thanks for the link. I'm pretty sure that's what I was thinking about.
Sean Copenhaver
@Davy8: Strings are mutable within mscorlib - that is, there are `internal` methods on the `String` class that mutate the current instance. Of course, there's nothing to stop you calling those methods from your own code using reflection, but it's inadvisable.
LukeH
A: 

Use a balanced ordered tree (or something similar) mapping start-of-range to end-of-range and data. This will be easy to implement for non-overlapping ranges.

Rafe
In fact, if your data is fixed, you can just use an array plus binary search.
Rafe
+3  A: 

If your dictionary is going to truly store a wide range of key values, an approach that expands all possible ranges into explicit keys will rapidly consume more memory than you likely have available.

You're best option is to use a data structure that supports some variation of binary search (or other O(log N) lookup technique). Here's a link to a generic RangeDictionary for .NET that uses an OrderedList internally, and has O(log N) performance.

Achieving constant-time O(1) lookup requires that you expand all ranges into explicit keys. This requires both a lot of memory, and can actually degrade performance when you need to split or insert a new range. This probably isn't what you want.

LBushkin
+1  A: 

As arootbeer has mentioned in his answer, the following code does not create multiple instances of the string "ABC"; rather, it interns a single instance and assigns a reference to that instance to each KeyValuePair<int, string> in dictionary:

var dictionary = new Dictionary<int, string>();
dictionary[0] = "ABC";
dictionary[1] = "ABC";
dictionary[2] = "ABC";

// etc.

OK, so in the case of string literals, you're only using one string instance per range of keys. Is there a scenario where this wouldn't be the case--that is, where you would be using a separate string instance for each key within the range (this is what I assume you're concerned about when you speak of "overconsumption of memory")?

Honestly, I don't think so. There are scenarios where multiple equivalent string instances may be created without the benefit of interning, yes. But I can't imagine these scenarios would affect what you're trying to do here.

My reasoning is this: you want to assign certain values to different ranges of keys, right? So any time you are defining a key-range-value pairing of this sort, you have a single value and several keys. The single part is what leads me to doubt that you'll ever have multiple instances of the same string, unless it is defined as the value for more than one range.

To illustrate: yes, the following code will instantiate two identical strings:

string x = "ABC";

Console.Write("Type 'ABC' and press Enter: ");
string y = Console.ReadLine();

Console.WriteLine(Equals(x, y));
Console.WriteLine(ReferenceEquals(x, y));

The above program, assuming the user follows instructions and types "ABC," outputs True, then False. So you might think, "Ah, so when a string is only provided at run-time, it isn't interned! So this could be where my values could be duplicated!"

But... again: I don't think so. It all comes back to the fact that you are going to be assigning a single value to a range of keys. So let's say your values come from user input; then your code would look something like this:

var dictionary = new Dictionary<int, string>();

int start, count;
GetRange(out start, out count);
string value = GetValue();

foreach (int key in Enumerable.Range(start, count))
{
    // Look, you're using the same string instance to assign
    // to each key... how could it be otherwise?
    dictionary[key] = value;
}

Now, if you were actually thinking more along the lines of what LBushkin mentions in his answer--that you may potentially have huge ranges, making it impractical to define a KeyValuePair<int, string> for each key within that range (e.g., if you have a range of 1-1000000)--then I would agree that you're best off with some sort of data structure that bases its lookup on a binary search. If that's more your scenario, say so and I will be happy to offer more ideas on that front. (Or you could just take a look at the link LBushkin already posted.)

Dan Tao