views:

254

answers:

3

I'm in need of some inspiration. For a hobby project I am playing with content analysis. I am basically trying to analyze input to match it to a topic map.

For example:

  • "The way on Iraq" > History, Middle East
  • "Halloumni" > Food, Middle East
  • "BMW" > Germany, Cars
  • "Obama" > USA
  • "Impala" > USA, Cars
  • "The Berlin Wall" > History, Germany
  • "Bratwurst" > Food, Germany
  • "Cheeseburger" > Food, USA
  • ...

I've been reading a lot about taxonomy and in the end, whatever I read concludes that all people tag differently and therefor the system is bound to fail.

I thought about tokenized input and stop word lists, but they are of course a lot of work to come up with and build. Building the relevant links between words and topics seems exhausting and also never ending cause whatever language you deal with, it's very rich and most languages also heavily rely on context. Let alone maintaining it.

I guess I need to come up with something smart and train it with topics I want it to be able to guess. Kind of like an Eliza bot.

Anyway, I don't believe there is something that does that out of the box, but does anyone have any leads or examples for technology to use in order to analyze input in order to extract meaning?

A: 

Sounds like you're looking for a Bayesian Network implementation. You may get by using something like Solr.

Also check out CI-Bayes. Joseph Ottinger wrote an article about it on theserverside.net earlier this year.

cwash
Can you point out some documentation on Solr that highlights this feature? I couldn't find anything by searching their documentation.
Till
Solr is really an enterprise search server (somewhat analogous to the Google Search Appliance) but what you were describing sounded to me like its faceted search feature.Check this out for some more info: http://people.apache.org/~hossman/apachecon2006us/faceted-searching-with-solr.pdf
cwash
+1  A: 

Hiya. I'd first look to OpenCalais for finding entities within texts or input. It's great, and I've used it plenty myself (from the Reuters guys).

After that you can analyze the text further, creating associations between entities and words. I'd probably look them up in something like WordNet and try to typify them, or even auto-generate some ontology that matches the domain you're trying to map.

As to how to pull it all together, there's many things you can do; the above, or two- or three-pass models of trying to figure out what words are and mean. Or, if you control the input, make up a format that is easier to parse, or go down the murky path of NLP (which is a lot of fun).

Or you could look to something like Jena for parsing arbitrary RDF snippets, although I don't like the RDF premise myself (I'm a Topic Mapper). I've written stuff that looks up words or phrases or names in WikiPedia, and rate their hitrate based on the semantics found in the WikiPedia pages (I could tell you the details more if requested, but isn't it more fun to work it out yourself and come up with something better than mine? :), ie. number of links, number of SeeAlso, amount of text, how big the discussion page, etc.

I've written tons of stuff over the years (even in PHP and Perl; look to Robert Barta's Topic Maps stuff on CPAN, especially the TM modules for some kick-ass stuff), from engines to parsers to something weird in the middle. Associative arrays which breaks words and phrases apart, creating cumulative histograms to sort their components out and so forth. It's all fun stuff, but as to shrink-wrapped tools, I'm not so sure. Everyones goals and needs seems to be different. It depends on how complex and sophisticated you want to become.

Anyway, hope this helps a little. Cheers! :)

AlexanderJohannesen
I think I tried to sign up multiple times. They still owe me a password. But I guess I'll try again and let you know how it turns out. Thanks very much!
Till
I know this took a while to accept -- we've been using a lot of OpenCalais so far. Thanks again for all the suggestions. :)
Till
+1  A: 

SemanticHacker does exactly what you want, out-of-the-box, and has a friendly API. It's somewhat inaccurate on short phrases, but just perfect for long texts.

  • “The way on Iraq” > Society/Issues/Warfare and Conflict/Specific Conflicts
  • “Halloumni” > N/A
  • “BMW” > Recreation/Motorcycles/Makes and Models
  • “Obama” > Society/Politics/Conservatism
  • “Impala” > Recreation/Autos/Makes and Models/Chevrolet
  • “The Berlin Wall” > Regional/Europe/Germany/States
  • “Bratwurst” > Home/Cooking/Meat
  • “Cheeseburger” > Home/Cooking/Recipe Collections; Regional/North America/United States/Maryland/Localities
apostlion
This looks the most promising out of all the suggestions on here and from my own research. Thank you very much, I'll keep you posted.
Till