views:

148

answers:

2

I have a large (~2.5M records) data base of image metadata. Each record represents an image and has a unique ID, a description field, a comma-separated list of keywords (say 20-30 keywords per image), and some other fields. There's no real database schema, and I have no way of knowing which keywords exists in the database without iterating over every image and counting them. Also, the metadata comes from several different suppliers, who each have their own ideas about how to fill out the different fields.

There are some things I would like to do with this metadata, but since I'm totally new to this kind of algorithms I don't even know where to begin looking.

  1. Some of these images have certain usage restrictions on them (given in text), but each supplier phrase them differently, and there is no way to guarantee consistency. I'd like to have a simple test I could apply to an image that gives an indication if that image is free from restrictions or not. It doesn't have to be perfect, just 'good enough'. I suspect I could use some kind Bayesian filter for this, right? I could train the filter with a corpus of images that I know are either restricted or restriction-free, and then the filter would be able to make predictions for the rest of the images? Or are there better ways?
  2. I would also like to be able to index these images according to 'keyword likeness', so that if I have one image, I could quickly tell which other images it shares the most keywords with. Ideally, the algorithm would also take into account that some keywords are more significant than others and weigh them differently. I don't even know where to start looking here, and would be very glad for any pointers :)

I'm working primarily in Java, but language choice is irrelevant here. I'm more interested in learning what approaches would be best for me to start reading up on. Thanks in advance :)

+2  A: 

definitely you have to start by turning your 'list of keywords' field into a real tagging scheme. the easiest one is a table of tags, and a 'Many-to-Many' relationship with the image table (that is, a third table where each record has a foreign key to an image and another foreign key to a keyword). it makes real fast to find all images with a certain set of keywords.

the bayesian filter to detect restriction phrasing, is interesting. i'd say go for it, unless you're pressed for time. if that's the case, a few simple pattern matching should pick up more than 90-95% of cases, and the rest could be quickly finished by hand by a couple of operators.

Javier
Using a relational database is unfortunatly not really feasible for the application I have in mind. Also, when searching for 'keyword likeness' I'm not really looking for images with a given set of keywords, but images with 'a good overlap' (hard to describe when you don't know the terminology).
fred-o
if it's not relational, but you can have more than one table, you can still work relations yourself. and any 'overlap' algorithm starts with finding images with a given (set of) keyword(s).
Javier
regardless of being relational or not, if you have a keyword->image mapping you can do several pre-processing. for instance, you can find which keywords are more discriminative, so you know, given an image, which of its keyword to use first to reduce the target subset.
Javier
@fred-o: How do you consider a relational setup to not be feasable? It seems that the relational model is absolutely perfect for what you are trying to do.
monksy
+1  A: 

(1) Looks like a classification problem with words in your text as features, and "Restricted" and "Not Restricted" as your labels. Bayesian filtering or any classification algorithm should do the trick.

(2) Looks like a clustering problem. First you want to come up with a good similarity function that returns a similarity score for two images bases on their keywords. Cosine similarity might be a good starting point, since you are comparing keywords. From there you can compute a similarity matrix and just remember a list of 'nearest neighbors' for each image in your dataset, or you can go further and use a clustering algorithm to come up with actual clusters of images.

Since you have so many records, you might want to skip computing the entire similarity matrix, and just compute clusters for a small, random sample of your dataset. You can then add the other data points to the appropriate clusters. If you want to preserve more similarity information you can look into soft clustering.

Hopefully that will get you started.

Imran
Clustering seems to be worth investigating. Thanks!
fred-o