views:

192

answers:

3

I have a large collection of custom objects that I have retrieved from a query in my system. Let's say these objects all have 5 different properties - FirstName, LastName, Gender, ZipCode and Birthday. For each of the different properties I would like to be able to get a list of all of the unique values and their counts and sort them in descending order. It is sort of a faceted navigation system. So if I have like 5000 results in my initial query then I would like to be able to display the top 10 FirstNames from most popular to least popular with the count next to it. And then the same with the other properties.

Currently I have a routine that goes through each item one at a time and examines the different properties and keeps a bunch of different hashtables with the information. It works but it is super slow. I think that going through each item one at a time is not very efficient. Is there some other type of C# structure I could use that would make getting this type of information easier? I know that SQL Server does a great job of this type of thing - but I don't think that is really a possibility here. I'm getting my list of custom objects from the API of a different system. So I would have to then take that list of objects and put them in to a temp table somehow and that sort of defeats the purpose I think. Plus SQL Server temp tables are connection specific I think and my app would re-use connections.

EDIT: What I am trying to avoid is having to iterate through the list and process each individual item. I was wondering if there was some data structure that would allow me to sort of query the whole list at once (like a database) and get the information. The problem is that our front end web server is just getting hammered because we have a lot of traffic on the server and people are hitting these faceted nav pages and I am looking for a more efficient way of doing it.

Any ideas?

Thanks, Corey

A: 

Keeping one dictionary per property should work fine. How slow is it? Can you show us the code you're using? 5000 items should be processed in the blink of an eye.

Are you using .NET 3.5? If so, LINQ could help you with a lot of this - in particular, using ToLookup with each property in turn would work pretty well.

Jon Skeet
No, I am not using .NET 3.5. What I am trying to avoid is having to iterate through the list and process each individual item. I was wondering if there was some data structure that would allow me to sort of query the whole list at once (like a database) and get the information. The problem is that our front end web server is just getting hammered because we have a lot of traffic on the server and people are hitting these faceted nav pages and I am looking for a more efficient way of doing it.
Corey Burnett
Hey, Jon, http://stackoverflow.com/questions/2072752/why-doesnt-my-threaded-net-app-scale-linearly-when-allocating-large-amounts-of
Will
@Corey: How would you expect *any* data structure to magically process the elements without iterating through the list at least *once*? Once should be all you require, but you do have to do it once...
Jon Skeet
Well, yes - I realize that something has to iterate through the list. I just figured that there was a better, more efficient way of doing it than the way I was doing it.
Corey Burnett
+1  A: 

i4o - Indexed LINQ http://www.codeplex.com/i4o allows to put indexes on objects.

It basically provides RDBMS-style indexing for clr.

Are you using a DBMS for your initial query? In this case the answer would be: Why not just design specific SQL queries?

George Polevoy
No the initial query is not from a DBMS. It's through a third party API so my querying is limited.
Corey Burnett
+1  A: 

Unfortunately, I'm pretty sure the answer to your question is, "No." If the only way you have of getting your data is an unindexed List<MyObject>, then something is going to have to go through those items one-by-one and analyze them for Top-N or create indices. Even if you pass that on to another tool (a temp database or third party data structure), you're just putting the processing somewhere else and your CPU will crank just as much. The solution you outline in your original question seems like the most reasonable thing to do.

A few suggestions:

  • Are these Top-N lists the same for all users, or could they be broken into a distinct number of use cases? You could get them once and store them in web cache. Maybe set a background process to update them every M minutes to keep them somewhat up-to-date.
  • Is it just a UI perception problem? Could you calculate and display the most important results first and then calculate the others in the background and deliver to the page asynchronously?
  • Beg the API provider for a more robust way to get results?? :)
  • Throw more hardware at it?? :)

Sorry for the non-answer, but I don't think there's a magic bullet here.

Dave
Thanks Dave. That was sort of what I assumed. I may have to just refactor my code a bit and look for ways to speed it up or optimize it. The system I am building is a faceted nav and allows people to select facets (such as gender or zip code, etc.) and then see a new list of objects that match the selected facets. And then of course the Top-N lists are different because they now only apply to the results for the currently selected facets. Hope that makes sense.
Corey Burnett
I could use some sort of caching so that if I see User A has asked for the exact same facets that User B selected 5 minutes ago, just give them results out of the cache. But then I need to be able to trigger the cache to refresh if the data in the original system changes.
Corey Burnett