Let's say I've got a set of a million tags and a text that needs to be parsed for these and possibly new tags. The amount of tags here is just an example to illustrate my thinking problem - too many to loop through in a linear way, too many to keep in memory etc.
Somehow I can't think of a solution with low footprint (and which stays speedy). I'm aware that one has to expect trade-offs, but I'd assume I'm overlooking some concepts.
This is especially interesting for intelligent tagging ( "Michael Jackson" = "artist" etc ) since the applied tag might not be part of the text itself.
Besides doing word blacklisting, caching of popular tags and huge sql queries, what would be the most effective way of approaching this?
(funny enough, I've to tag this question myself :-) )
Since I'm limited in comment space, let me add some thoughts here:
- I agree that using integer hashes improves speed. Good idea.
- Hashes won't solve iteration problems (looping through each hash/tag while checking a word or word combination against the list of tags)
- To refine the problem: Assume a text like "hello world". This text has 3 potential tags ("hello", "world" and "hello world"). The tag list might only contain "hello", but "world" or "hello world" might be added after parsing which would mean these tags are not applied to the text.
Problems:
- Assuming a text of book size, iterating through all combinations (like "Nine Inch Nails" but let's assume the combination limit is 4 words) to compare them to tags in database takes a long time, even assuming the use of integer hashes.
- The tag list is potentially long, so iterating over stored tags is probably slow as well.
- Tag updates would mean additional full text searches on texts - depending on the amount of texts and their length and that's potentially a db killer and not efficient at all?
- How would one find "relevant" new tags automatically? (again "Nine Inch Nails" comes to mind in an article about music - but "released a new song" would not make a good tag). That's probably a question on it's own though.