Hello all,
I asked a question similar to this one a couple of weeks ago, but I did not ask the question correctly. So I am re-asking here the question with more details and I would like to get a more AI oriented answer.
I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives.
- Seagate Hard Drive 500Go
- Seagate Hard Drive 120Go for laptop
- Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive
- New and shinny 500Go hard drive from Seagate
- Seagate Barracuda 7200.12
- Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail
- GE Spacemaker Laudry
- Mazda3 2010
- Mazda3 2009 2.3L
For a human being, the hard drives 3 and 5 are the same. We could go a little bit further and suppose that the products 1, 3, 4 and 5 are the same and put in other categories the product 2 and 6.
In my previous question, someone suggested to me to use feature extraction. It works very well when we have a small dataset of predefined descriptions (all hard drives), but what about all the other kind of description? I don't want to start to write regex based feature extractors for all the descriptions my application could face, it doesn't scale. Is there any machine learning algorithm that could help me to achieve this? The range of description that I can get is very wide, on line 1, it could be a fridge, and then on the next line, a hard drive. Should I try to take the Neural Network path? What should be my inputs?
Thank you for the help!