views:

220

answers:

2

Hello all,

I asked a question similar to this one a couple of weeks ago, but I did not ask the question correctly. So I am re-asking here the question with more details and I would like to get a more AI oriented answer.

I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives.

  1. Seagate Hard Drive 500Go
  2. Seagate Hard Drive 120Go for laptop
  3. Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive
  4. New and shinny 500Go hard drive from Seagate
  5. Seagate Barracuda 7200.12
  6. Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail
  7. GE Spacemaker Laudry
  8. Mazda3 2010
  9. Mazda3 2009 2.3L

For a human being, the hard drives 3 and 5 are the same. We could go a little bit further and suppose that the products 1, 3, 4 and 5 are the same and put in other categories the product 2 and 6.

In my previous question, someone suggested to me to use feature extraction. It works very well when we have a small dataset of predefined descriptions (all hard drives), but what about all the other kind of description? I don't want to start to write regex based feature extractors for all the descriptions my application could face, it doesn't scale. Is there any machine learning algorithm that could help me to achieve this? The range of description that I can get is very wide, on line 1, it could be a fridge, and then on the next line, a hard drive. Should I try to take the Neural Network path? What should be my inputs?

Thank you for the help!

+2  A: 

You should look at both clustering and classification. Your categories seem open-ended and thus suggest that clustering may fit the problem better. As for input representation, you can try your luck with extracting word and character n-grams. Your similarity measure may be the count of common n-grams, or something more sophisticated. You may need to label the resulting clusters manually.

Yuval F
+5  A: 

I would look at some Bayesian classification methods. It would involve training the classifier to recognize particular words as indicating probability that a product belongs to one of your classes. For example, after being trained, it could recognize that if a product description has "Seagate" in it, there's a 99% chance that it's a hard drive, whereas if it has "Mazda" there's a 97% chance it's a car. A word like "new" probably would end up not contributing much to any classification, which is the way you want it to work.

The downside to this would be that it typically requires fairly large corpora of training data before it starts to work well, but you can set it up so that it continues to modify its percentages while being in production (if you notice that it classified something incorrectly), and it will eventually become very effective.

Bayesian techniques are used quite heavily recently for spam-filtering applications, so it might be good to do some reading on ways it's been used there.

Chad Birch