I'm working on an application that runs on Windows Mobile 6 that needs to be able to retrieve all items from an item table that contain a given string (provided by the end user) within the item's description field. The problem is that there are approximately 170,000 items in the table. Since I need to return all items that contain the string anywhere within the description I'm forced to use a LIKE %string%, which eliminates any chance to use the index. The data and table structure are originally based on a Progress database, which has a wonderful contains operator on any word-indexed fields. This is not the case on our mobile application, since it uses SQL Server Compact 3.5.
Basically, my DAL runs the query and retrieves an SqlCeDataReader and then uses an ItemFactory to create a List object that contains only the matched items. This obviously lets us keep our domain/business objects separate from the data access layer.
Fine and dandy, except for the 8m and 42s that it takes to retrieve items when I search for all items that contain something like "golf" in the description. Obviously this is not an acceptable time frame for the end user.
My first attempt was to instead retrieve all items back from the database using SELECT * FROM Item" (with an order by clause on one of the main indexed fields). At this point I ran an IndexOf check as I ran through the SqlCeDataReader and had the ItemFactory only add items to the List object if they contained the requested description text. This improved the speed down to 1m 46s. Not too shabby, but still too slow.
I then tried another approach that showed promise... almost... While the application starts up, I tried creating a List that contained all item objects within the database (takes about 2 minutes to run the query and populate the entire list, but at least it's only once as the app is initializing... still... ugh). Once the list is complete, I can easily run queries on that list doing things such as the following (I hope my syntax is right... I'm not at work right now and I don't have Visual Studio on the pc I'm sitting at):
List<Item> specificItems =
AllItems.FindAll(i => i.Description.IndexOf(searchString, StringComparison.OrdinalIgnoreCase) >= 0);
This approach knocked it down to 21s. Very nice (still slow though in the grand scheme of things). However, the problem is that the memory usage is way too great if I load all the items from the database. I had to actually cut off the last 20,000 items (so the 21s time frame probably would have been more like 25s) during the initial load, because an OutOfMemoryException was being thrown. According to the memory manager on the emulator, I still had about 20 MB of free RAM, but I've heard that a process can only have 32 MB or RAM associated to it (not sure if that's true for WM 6, but it appears so).
To make sure it wasn't because I was using a List object to hold all the items (which I was instantiating with the needed capacity in its constructor to avoid dynamic resizing), which I've also read may cause extra memory usage when it implicity calls EnsureCapacity, I tried using an Item[] array (sized ahead of time). This still had the memory issue and the size difference was negligible.
Ok enough rambling. I know I'm likely going to have to some how limit the records returned from the database by the datareader (through some indexed search on a different type of field) and then will likely use indexOf on that smaller subset of items to get maximum performance (thus skipping the Like operator all together). This will cause the end user to have to enter more than just a description search though (perhaps item hierarchy information to limit which type of items to search within).
Any ideas? Am I going about this the wrong way?
Thanks for listening (sorry this post is long, I'm kind of thinking out loud).
Oh I should add (just in summary) what I'm using:
- Windows Mobile 6
- Sql Server Compact Edition 3.5
- C# 3.5
UPDATE: While the Bloom Filter approach mentioned below seemed interesting, I could not fulfill one requirement (which I didn't really specify above). I can't really match words that are contained inside other words (e.g. "club" would not return "clubs"). Due to this, I was forced to use a different approach altogether (Kent Fredric... thanks for pointing this out). I've marked Kent's answer as the correct one, since his approach was the one that filled the most requirements (Mitch, your's had a similar issue as the Bloom filter suggested by Jaunder). However, I have gone a different approach (for now...) than his way as well.
What I've done is pulled all item objects into memory, with only item numbers and descriptions (which keeps it under the memory limitations, however it still does cause a longer initialization than I like... multithreading and loading that information behind the scenes while the application is running can take care of that I guess). To perform the searches I've written my own contains routine. The routine is written in unmanaged c# code that uses two pointers and a couple of loops to run through the description and the required matching text. If it finds a match anywhere in the description, it adds the item number to an array. Once all items have been searched, a new query goes back to the database and grabs only the matching item numbers (which is very fast due to the index on an integer field). Then those items are created in a List with all information (not just the item number and description). The whole operation takes approximately 5 - 10 seconds (depending on the description), which is good enough for now.
I'll still look into further optimizing this (might be able to track how many characters the search term is... if there are less characters remaining on the item description than the required text, the loop could continue straight to the next item).
Any suggestions are still welcome. For now I have marked Kent's answer as "the most correct" for my question.
Props to Dolch for helping me write the contains routine.