views:

59

answers:

1

Hello,

I'm trying to build a search engine that goes through online vehicle classifieds such as Oodle, eBay motors, and craigslist. I also have a large database of standard vehicle names and specifications about them. What I would like to do is for each record that I find through the classified site, be able to determine exactly what vehicle model, style it is (from my database). For example, a standard name for a ford truck in my db is: 2003 Ford F150

however on classified sites, people might refer to is as: "2003 Ford F 150" or "2003 Ford f-150" or "03 Ford truck 150". Is there an effective data mining/text classification algorithm to be able to normalize these texts to the standard name above?

Thanks!

+1  A: 

You could use the Levenshtein distance to match the found string against your database records.

Another (probably better) idea is to tokenize the strings and use a term vector model for the vehicle names. This way you can use cosine similarity to find relevant matches.

Pankrat