views:

219

answers:

4

Here is the situation:

I have a webpage that I have scraped as a string.

I have several fields in a MSSQL database. For example, car model, it has an ID and a Name, such as Mustang or Civic. It is pre-filled with most models of car.

I want to find any match for any row in my models table. So if I have Civic, Mustang and E350 in my Model Table I want to find any occurance of any of the three on the page I have scraped.

What is an efficient way to do this in C#. I am using LINQ to SQL to interface with the db.

Does creating a dictionary of all models, tokenizing the page and iterating through the tokens make sense? Or should I just iterate through the tokens and use a WHERE clause and ask the database if there is a match?

    //Dictionary dic contains all models from the DB, with the name being the key and the id being the value...
    foreach(string pageToken in pageTokens)
    {
         if(dic.ContainsKey(pageToken)) 
         {
              //Do what I need to do
         }
    }

Both of these methods seem terrible to me. Any suggestions on what I should do? Something with set intersection I would imagine might be nice?

Neither of these methods address what happens when a Model name is more than one word..like "F150 Extended Cab". Thoughts on that?

A: 

You should be using Regex, not tokenizing based on space.

With Regex you could use spaces and be just fine, and I believe it would be faster than tokenizing and looping through list of possible values.

How you construct that Regex though I am not sure.

Most simply, you could simply build a Regex with every model like

(Model 1|Model 2|Model 3) 

But I am sure there are more efficient ways to do this in regex.

Aequitarum Custos
If there are many models, the regex pattern will be pretty big... I'm not sure how efficient that would be.
Thomas Levesque
Yeah, interesting question for sure.
Aequitarum Custos
Some regex engines may optimize it, but a traditional NFA/DFA with that many brances would be slow and terribly memory-inefficient. I'd expect a regex to perform *worse* than a naieve search.
Robert Fraser
Yup, here's a comparison of Aho-Corasick, naieve search (IndexOf) and regex: http://www.codeproject.com/KB/recipes/ahocorasick.aspx . Regex performance is godawful.
Robert Fraser
+3  A: 

My first approach would be super-simple:

foreach(string carModel in listOfCarModelsFromDatabase) {
    if(pageText.Contains(carModel) {
        // do something
    }
}

I'd only start worrying about making it faster if the above weren't fast enough. The list of car models just can't possibly be that large (< 10000?) and it's only one page of text.

Jason
This would get rather inefficient with large lists of models, but would work for small lists.
Aequitarum Custos
Yes, but there just aren't that many car models. How many car companies has there been in the history of the world? Fifty? What's the maximum number of car models each has had? Two hundred? So maybe there's been 50 * 200 = 10000 models in the history of the world? This is certainly the right order of magnitude.
Jason
I have about 1500 objects I need to compare to the text.
Blankasaurus
True, I'm just imagining that if he is "indexing" 100000 pages, each with 10 kilobytes of text, that's ~1 gig of data, to foreach loop over even say 500 models.
Aequitarum Custos
752 total models, but I have to compare other stuff too.
Blankasaurus
String matching from a dictionary is a well-understood problem in computer science; this answer is like suggesting bubble sort.
Robert Fraser
This isn't foolproof and will likely return some false hits. What happens when the page text contains a sentence like "I drove to my Civics class today"? Just a thought. I left a comment for the OP to determine whether any post scraping efforts have taken place.
Ahmad Mageed
At a time I will be searching about 500 pages with 1500 objects I need to look for. So this method will probably work. Didn't think to Contains for strings... =D Will still take a long time I would imagine. Ill try it out and let you know.
Blankasaurus
@Aequitarum Custos: "I have a webpage that I have scraped as a string." It's a single page, not 100,000. Don't needlessly generalize a problem.
Jason
@Robert Fraser: Not even close to an accurate simile. See also http://en.wikipedia.org/wiki/KISS_principle and http://en.wikipedia.org/wiki/Overengineering.
Jason
@Casoninabox: The thing is, this approach is just so simple to code, and so easy to get right. If it's fast enough great, move on and create value elsewhere. If not, then look at more sophisticated approaches.
Jason
@Jason: True, but I thought Casoninabox was asking the question precisely because the simple solution was unacceptable.
Robert Fraser
+4  A: 

Searching for multiple strings in a larger text is a well-understood problem, and signifigant research has been made into making it fast. The two most popular and effective methods for this are the Aho-Corasick Algorithm (I'd rcommend this one) and the Rabin-Karp Algorithm. They use a little preprocessing, but are orders of magnitude less complex & faster than the naieve method (the naieve method is worst-case O(m*n^2*p) where m is the length of the long string [the webpage you scraped] and n is the average length of the needles and p is the number of needles). Aho-Corsaik is linear. A C# implementation of it can be found at CodeProject for free.

Edit: Oops, I was wrong about the complexity of Aho-Corasick -- it's linear in the number & length of input strings + the size of the string being analyzed [the scraped text] plus the number of matches. But it's still linear and linear is a lot better than cubic :-).

Robert Fraser
+1 Very cool. Thanks.
Blankasaurus
I decided to do the processing at a later stage, which eliminates the performance issues. This is very useful info though.
Blankasaurus
A: 

For a really simple solution that does substring matches (that should perform reasonably well), you could use a parameterized SQL query like this:

select ModelID, ModelName
from Model
where ? like '%' + ModelName + '%'

where the ? is a parameter that gets replaced with the entire webpage text.

RedFilter
-1: This is the naive, obvious solution. It's morally equivalent to the "awful" solution the author posted. While it might be fast enough for his purposes, he kind of asked for an alternative.
Brian
What database engine is being used? Some database engines will implement a good search method automatically and this would be better (more maintainable, more optimized, less data transfeer, etc.) than a hand-implementation of Aho-Corasick or Rabin-Karp.
Robert Fraser
As I said in my answer, this is a very simple solution. What I think might be interesting to some is the inverted use of `LIKE` - many people are not aware that can be done.
RedFilter