views:

88

answers:

1

Let's say I've got a set of a million tags and a text that needs to be parsed for these and possibly new tags. The amount of tags here is just an example to illustrate my thinking problem - too many to loop through in a linear way, too many to keep in memory etc.

Somehow I can't think of a solution with low footprint (and which stays speedy). I'm aware that one has to expect trade-offs, but I'd assume I'm overlooking some concepts.

This is especially interesting for intelligent tagging ( "Michael Jackson" = "artist" etc ) since the applied tag might not be part of the text itself.

Besides doing word blacklisting, caching of popular tags and huge sql queries, what would be the most effective way of approaching this?

(funny enough, I've to tag this question myself :-) )

Since I'm limited in comment space, let me add some thoughts here:

  • I agree that using integer hashes improves speed. Good idea.
  • Hashes won't solve iteration problems (looping through each hash/tag while checking a word or word combination against the list of tags)
  • To refine the problem: Assume a text like "hello world". This text has 3 potential tags ("hello", "world" and "hello world"). The tag list might only contain "hello", but "world" or "hello world" might be added after parsing which would mean these tags are not applied to the text.

Problems:

  • Assuming a text of book size, iterating through all combinations (like "Nine Inch Nails" but let's assume the combination limit is 4 words) to compare them to tags in database takes a long time, even assuming the use of integer hashes.
  • The tag list is potentially long, so iterating over stored tags is probably slow as well.
  • Tag updates would mean additional full text searches on texts - depending on the amount of texts and their length and that's potentially a db killer and not efficient at all?
  • How would one find "relevant" new tags automatically? (again "Nine Inch Nails" comes to mind in an article about music - but "released a new song" would not make a good tag). That's probably a question on it's own though.
+1  A: 

Hash each word in the incoming text and use it to match the hashes of the tags you want to match. You can use a database to store and lookup the hash values so you don't have to do it in memory.

I don't see how that would help efficiency? I'd still have to run an iteration over all hashes with something like SELECT x FROM table WHERE hash=? - that's practically identically to not using hashes (SELECT x FROM table WHERE tag=?) and doesn't seem to offer any advantage but the disadvantage of hashes generally being longer than the average keyword and therefore increasing storage requirements and query weight?
Flim
Hashes are not longer than keywords. A hash can be any length according to the number of tag, but a 32 bit number would do the trick and is equivalent to a four letter word. 16 bits would probably do if you wanted to save space. The hash would be indexed so there wouldn't be any iteration. That's about as efficient as you get, unless you store it all in memory.
Sorry, that's true of course. Somehow I didn't think of hashes that way. Thanks for teaching :-)
Flim
Re the problems: databases are very adept at matching values, it's largely what they do besides storing data. If you store all your text as hashed words then matching new tags is very efficient, assuming you index the hashed words. The issues with multiple words is not difficult since you can search for the three (hashed) words in "Nine Inch Nails" then check that they are adjacent and in the right order in the text.