views:

69

answers:

2

I'm working on a project where Arrays are the default data structure for everything, and every query is a linear search in the form of:

  • Need a customer with a particular name? customer.Find(x => x.Name == name)
  • Need a customer with a particular unique id? customer.Find(x => x.Id == id)
  • Need a customer of a particular type and age? customer.Find(x => x is PreferredCustomer && x.Age >= age)
  • Need a customer of a particular name and age? customer.Find(x => x.Name == name && x.Age == age)

In almost all instances, the criteria for lookups is well-defined. For example, we only search for customers by one or more of the properties Id, Type, Name, or Age. We rarely search by anything else.

Is a good data structure to support arbitrary queries of these types with lookup better than O(n)? Any out-of-the-box implementations for .NET?

+4  A: 

For in memory, you have a few options.

Most options will be O(n). That being said, Dictionary lookups can approach O(1).

One option is to store your customers in multiple Dictionaries, each with a key set to the Name, Id, and Age. If you use the same object references in the dictionaries, you can make any single lookup O(1), without a huge amount of overhead.

Granted, this becomes less practical as your criteria count rises, but with 3, it's not too bad.

If you want more flexibility, then a database is an appropriate option. Many databases have the option of working as a completely in-memory database, including SQLite, which would allow arbitrary queries at much better than O(n) speed.

Reed Copsey
Problem with multiple dictionaries is that they don't support composite lookups: an age Dictionary<int, List<Customer>> returns a range of customers for a given age in constant time, but what if I also want to search by name too? I'm right back to linear search.
Juliet
If you combine the results (via ANDing/ORing) I'd think you'd get better than linear.
micahtan
@Juliet: You're back to linear search, but in a very restricted space. Each additional search is restricted to just the valid elements from the first, so the search space is much smaller. The second search is O(n), but N becomes very small. That being said, the other option is still to use an in-memory DB, which provides you fast querying with arbitrary criteria...
Reed Copsey
@Juliet assuming `IDictionary<int, IEnumerable<Customer>> AgeIndex` and `IDictionary<String, IEnumerable<Customer>> NameIndex`, you get better than linear time by using `var results = AgeIndex[age].Intersect(NameIndex[name])`. It's O(m) where m is the sum of the size of the two result sets, but without indexing on ALL possible search criteria (e.g., `IDictionary<AgeNamePair, IEnumerable<Customer>> AgeNameIndex`), you can't do much better. Class time/space tradeoff.
Dathan
@Dathan: results of `AgeIndex[age].Insersect(NameIndex[Name])` is not better than a linear search`. Consider I have a few guys named Steve with ages 20, 21, 22, 23, and 25. When I search for 20 year olds, I get back the name Steve -- so we pass it to the name index and I get back 5 Steve's who all have different ages, or I get back a list of 20 year olds who all have different names. So we're back to a linear search.
Juliet
@Juliet: Take your case. Say you have 1000 customers, and 5 "Steve"s, and 150 twenty year olds. You do AgeIndex[age], which returns the 150 twenty year olds, and intersect that with NameIndex[Name], which returns the 5 steves. Now, you're down to one intersection between a 20 element collection and a 5 element collection. It's basically linear at that point, but you're dealing with a very small set (150+5) vs. 1000... Also, intersect uses a set internally, so it's not a "linear" search in the same fashion. Profile this, it's MUCH faster than linear, especially with real data.
Reed Copsey
@Juliet What Reed said. (c:
Dathan
@Reed: ok, that makes sense :) Is there a good generic solution which doesn't require making and populating dictionaries for every type I want to index? Additionally, is there a good solution that supports range searches like "age >= 20"?
Juliet
+1, +answer: I was skeptical at first, thought I would need to reach out to an exotic data structure like a kd-tree, but this really worked and helped to improve wicked slow spots 5x. With a little more optimization, I think I can squeeze out a lot more speed -- in the mean time, thank you! :)
Juliet
A: 

Why can't you put your data into several dictionaries? One dictionary maps name to the data, another maps age, etc.

Vlad
For a start its not a very generic solution. We have these types of searches on more than just customers, and it would be a little insane to create a 1000s of dictionaries for the 100s of collection types we regularly search. Second, multiple dictionaries don't support searches on composite keys in reasonable time, for example I can't search for people with a given name AND age without a linear search through one of my dictionaries.
Juliet