tags:

views:

206

answers:

5

I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives.

  1. Seagate Hard Drive 500Go
  2. Seagate Hard Drive 120Go for laptop
  3. Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive
  4. New and shinny 500Go hard drive from Seagate
  5. Seagate Barracuda 7200.12
  6. Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail

For a human being, the hard drives 3 and 5 are the same. We could go a little bit further and suppose that the products 1, 3, 4 and 5 are the same and put in other categories the product 2 and 6.

We have a huge list of products that I would like to classify. Does anybody have an idea of what would be the best algorithm to do such thing. Any suggestions?

I though of a Bayesian classifier but I am not sure if it is the best choice. Any help would be appreciated!

Thanks.

+3  A: 

You need at least two components:

First, you need something that does "feature" extraction, i.e. that takes your items and extracts the relevant information. For example, "new and shinny" is not as relevant as "500Go hard drive" and "seagate". A (very) simple approach would consist of a simple heuristic extracting manufacturers, technology names like "USB2.0" and patterns like "GB", "RPM" from each item.

You then end up with a set of features for each item. Some machine learning people like to put this into a "feature vector", i.e. it has one entry for each feature, being set to 0 or 1, depending on whether the feature exists or not. This is your data representation. On this vectors you can then do a distance comparison.

Note that you might end up with a vector of thousands of entries. Even then, you then have to cluster your results.

Possibly useful Wikipedia articles:

Manuel
Thank you! Very interesting approach!
Martin
+1  A: 

One of the problems you will encounter is to decide Nearest Neighbours in non-linear or non-ordered attributes. I'm building on Manuel's entry here.

One problem you will have is to decide on proximity of (1) Seagate 500Go, (2) Seagate Hard Drive 120Go for laptop, and (3) Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail:

Is 1 closer to 2 or to 3? Do the differences justify different categories?

A human person would say that 3 is between 1 and 2, as an external HD can be used on both kind of machines. Which means that if somebody searches for a HD for his desktop, and broadens the scope of selection to include alternatives, external HDs will be shown too, but not laptop HDs. Probably, SSDs, USB memory sticks, CD/DVD drives will even show up before laptop drives, enlarging the scope.

Possible solution:

Present users with pairs of attributes and let them weight proximity. Give them a scale to tell you how close together certain attributes are. Broadening the scope of a selection will then use this scale as a distance function on this attribute.

Ralph Rickenbach
+1  A: 

To actually classify a product, you could use somewhat of a "enhanced neural network" with a blackboard. (This is just a metaphore to get you thinking in the right direction, not a strict use of the terms.)

Imagine a set of objects that are connected through listeners or events (just like neurons and synapsis). Each object has a set of patterns and tests the input against these patterns.

An example:

  • One object tests for ("seagate"|"connor"|"maxtor"|"quantum"| ...)
  • Another object tests for [:digit:]*(" ")?("gb"|"mb")
  • Another object tests for [:digit:]*(" ")?"rpm"

All these objects connect to another object that, if certain combinations of them fire, categorizes the input as a hard drive. The individual objects themselves would enter certain characterizations into the black board (common writing area to say things about the input) such as manufacturer, capacity, or speed.

So the neurons do not fire based on a threshhold, but on a recognition of a pattern. Many of these neurons can work highly parallel on the blackboard and even correct categorizations by other neurons (maybe introducing certainties?)

I used something like this in a prototype for a product used to classify products according to UNSPSC and was able to get 97% correct classification on car parts.

Ralph Rickenbach
Thank you Malach! Super interesting!
Martin
A: 

There's no easy solution for this kind of problem. Especially if your list is really large (millions of items). Maybe those two papers can point you in the right direction:

http://www.cs.utexas.edu/users/ml/papers/normalization-icdm-05.pdf http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle_SchmidtThieme2006-Object_Identification_with_Constraints.pdf

A: 

MALLET has implementations of CRFs and MaxEnt that can probably do the job well. As someone said earlier you'll need to extract the features first and then feed them into your classifier.

Thien