I have a large (~2.5M records) data base of image metadata. Each record represents an image and has a unique ID, a description field, a comma-separated list of keywords (say 20-30 keywords per image), and some other fields. There's no real database schema, and I have no way of knowing which keywords exists in the database without iterating over every image and counting them. Also, the metadata comes from several different suppliers, who each have their own ideas about how to fill out the different fields.
There are some things I would like to do with this metadata, but since I'm totally new to this kind of algorithms I don't even know where to begin looking.
- Some of these images have certain usage restrictions on them (given in text), but each supplier phrase them differently, and there is no way to guarantee consistency. I'd like to have a simple test I could apply to an image that gives an indication if that image is free from restrictions or not. It doesn't have to be perfect, just 'good enough'. I suspect I could use some kind Bayesian filter for this, right? I could train the filter with a corpus of images that I know are either restricted or restriction-free, and then the filter would be able to make predictions for the rest of the images? Or are there better ways?
- I would also like to be able to index these images according to 'keyword likeness', so that if I have one image, I could quickly tell which other images it shares the most keywords with. Ideally, the algorithm would also take into account that some keywords are more significant than others and weigh them differently. I don't even know where to start looking here, and would be very glad for any pointers :)
I'm working primarily in Java, but language choice is irrelevant here. I'm more interested in learning what approaches would be best for me to start reading up on. Thanks in advance :)