ansaurus

Question

Matching Two Large Sets of Strings in C#

Answer 1

A:

You should be using Regex, not tokenizing based on space.

With Regex you could use spaces and be just fine, and I believe it would be faster than tokenizing and looping through list of possible values.

How you construct that Regex though I am not sure.

Most simply, you could simply build a Regex with every model like

(Model 1|Model 2|Model 3)

But I am sure there are more efficient ways to do this in regex.

Aequitarum Custos 2010-01-19 21:00:54

If there are many models, the regex pattern will be pretty big... I'm not sure how efficient that would be.

Thomas Levesque 2010-01-19 21:05:11

Yeah, interesting question for sure.

Aequitarum Custos 2010-01-19 21:07:34

Some regex engines may optimize it, but a traditional NFA/DFA with that many brances would be slow and terribly memory-inefficient. I'd expect a regex to perform *worse* than a naieve search.

Robert Fraser 2010-01-19 21:13:01

Yup, here's a comparison of Aho-Corasick, naieve search (IndexOf) and regex: http://www.codeproject.com/KB/recipes/ahocorasick.aspx . Regex performance is godawful.

Robert Fraser 2010-01-19 21:25:29

Answer 2

+3 A:

My first approach would be super-simple:

foreach(string carModel in listOfCarModelsFromDatabase) {
    if(pageText.Contains(carModel) {
        // do something
    }
}

I'd only start worrying about making it faster if the above weren't fast enough. The list of car models just can't possibly be that large (< 10000?) and it's only one page of text.

Jason 2010-01-19 21:01:03

This would get rather inefficient with large lists of models, but would work for small lists.

Aequitarum Custos 2010-01-19 21:02:42

Yes, but there just aren't that many car models. How many car companies has there been in the history of the world? Fifty? What's the maximum number of car models each has had? Two hundred? So maybe there's been 50 * 200 = 10000 models in the history of the world? This is certainly the right order of magnitude.

Jason 2010-01-19 21:04:16

I have about 1500 objects I need to compare to the text.

Blankasaurus 2010-01-19 21:09:30

True, I'm just imagining that if he is "indexing" 100000 pages, each with 10 kilobytes of text, that's ~1 gig of data, to foreach loop over even say 500 models.

Aequitarum Custos 2010-01-19 21:09:48

752 total models, but I have to compare other stuff too.

Blankasaurus 2010-01-19 21:11:13

String matching from a dictionary is a well-understood problem in computer science; this answer is like suggesting bubble sort.

Robert Fraser 2010-01-19 21:11:35

This isn't foolproof and will likely return some false hits. What happens when the page text contains a sentence like "I drove to my Civics class today"? Just a thought. I left a comment for the OP to determine whether any post scraping efforts have taken place.

Ahmad Mageed 2010-01-19 21:13:06

At a time I will be searching about 500 pages with 1500 objects I need to look for. So this method will probably work. Didn't think to Contains for strings... =D Will still take a long time I would imagine. Ill try it out and let you know.

Blankasaurus 2010-01-19 21:13:26

@Aequitarum Custos: "I have a webpage that I have scraped as a string." It's a single page, not 100,000. Don't needlessly generalize a problem.

Jason 2010-01-19 21:31:10

@Robert Fraser: Not even close to an accurate simile. See also http://en.wikipedia.org/wiki/KISS_principle and http://en.wikipedia.org/wiki/Overengineering.

Jason 2010-01-19 21:38:54

@Casoninabox: The thing is, this approach is just so simple to code, and so easy to get right. If it's fast enough great, move on and create value elsewhere. If not, then look at more sophisticated approaches.

Jason 2010-01-19 21:51:40

@Jason: True, but I thought Casoninabox was asking the question precisely because the simple solution was unacceptable.

Robert Fraser 2010-01-19 21:56:30

Answer 3

+4 A:

Searching for multiple strings in a larger text is a well-understood problem, and signifigant research has been made into making it fast. The two most popular and effective methods for this are the Aho-Corasick Algorithm (I'd rcommend this one) and the Rabin-Karp Algorithm. They use a little preprocessing, but are orders of magnitude less complex & faster than the naieve method (the naieve method is worst-case O(m*n^2*p) where m is the length of the long string [the webpage you scraped] and n is the average length of the needles and p is the number of needles). Aho-Corsaik is linear. A C# implementation of it can be found at CodeProject for free.

Edit: Oops, I was wrong about the complexity of Aho-Corasick -- it's linear in the number & length of input strings + the size of the string being analyzed [the scraped text] plus the number of matches. But it's still linear and linear is a lot better than cubic :-).

Robert Fraser 2010-01-19 21:01:08

+1 Very cool. Thanks.

Blankasaurus 2010-01-19 21:16:47

I decided to do the processing at a later stage, which eliminates the performance issues. This is very useful info though.

Blankasaurus 2010-01-19 23:02:23

Answer 4

A:

For a really simple solution that does substring matches (that should perform reasonably well), you could use a parameterized SQL query like this:

select ModelID, ModelName
from Model
where ? like '%' + ModelName + '%'

where the ? is a parameter that gets replaced with the entire webpage text.

RedFilter 2010-01-19 21:04:50

-1: This is the naive, obvious solution. It's morally equivalent to the "awful" solution the author posted. While it might be fast enough for his purposes, he kind of asked for an alternative.

Brian 2010-01-19 21:14:22

What database engine is being used? Some database engines will implement a good search method automatically and this would be better (more maintainable, more optimized, less data transfeer, etc.) than a hand-implementation of Aho-Corasick or Rabin-Karp.

Robert Fraser 2010-01-19 21:16:15

As I said in my answer, this is a very simple solution. What I think might be interesting to some is the inverted use of `LIKE` - many people are not aware that can be done.

RedFilter 2010-01-19 21:28:57

ansaurus

tags:

views:

answers:

Matching Two Large Sets of Strings in C#

related questions