views:

945

answers:

3

Hi,

I was wondering how as semantic service like Open Calais figures out the names of companies, or people, tech concepts ,keywords etc from a piece of text. Is it because they have a large database that they match the text against?

How would a service like Zemanta know what images to suggest to a piece of text for instance?

Was hoping someone could shed some light on this. Thanks alot Marco

A: 

Open Calais probably use language parsing technology and language statics to guess which words or phrases are Names, Places, Companies, etc. Then, it is just another step to do some kind of search for those entities and return meta data.

Zementa probably does something similar, but matches the phrases against meta-data attached to images in order to acquire related results.

It certainly isn't easy.

EndangeredMassa
+4  A: 

I'm not familiar with the specific services listed, but the field of natural language processing has developed a number of techniques that enable this sort of information extraction from general text. As Sean stated, once you have candidate terms, it's not to difficult to search for those terms with some of the other entities in context and then use the results of that search to determine how confident you are that the term extracted is an actual entity of interest.

OpenNLP is a great project if you'd like to play around with natural language processing. The capabilities you've named would probably be best accomplished with Named Entity Recognizers (NER) (algorithms that locate proper nouns, generally, and sometimes dates as well) and/or Word Sense Disambiguation (WSD) (eg: the word 'bank' has different meanings depending on it's context, and that can be very important when extracting information from text. Given the sentences: "the plane banked left", "the snow bank was high", and "they robbed the bank" you can see how dissambiguation can play an important part in language understanding)

Techniques generally build on each other, and NER is one of the more complex tasks, so to do NER successfully, you will generally need accurate tokenizers (natural language tokenizers, mind you -- statistical approaches tend to fare the best), string stemmers (algorithms that conflate similar words to common roots: so words like informant and informer are treated equally), sentence detection ('Mr. Jones was tall.' is only one sentence, so you can't just check for punctuation), part-of-speech taggers (POS taggers), and WSD.

There is a python port of (parts of) OpenNLP called NLTK (http://nltk.sourceforge.net) but I don't have much experience with it yet. Most of my work has been with the Java and C# ports, which work well.

All of these algorithms are language-specific, of course, and they can take significant time to run (although, it is generally faster than reading the material you are processing). Since the state-of-the-art is largely based on statistical techniques, there is also a considerable error rate to take into account. Furthermore, because the error rate impacts all the stages, and something like NER requires numerous stages of processing, (tokenize -> sentence detect -> POS tag -> WSD -> NER) the error rates compound.

rcreswick
+4  A: 

Hi Marco,

Michal Finkelstein from OpenCalais here.

First, thanks for your interest. I'll reply here but I also encourage you to read more on OpenCalais forums; there's a lot of information there including - but not limited to: http://opencalais.com/tagging-information http://opencalais.com/how-does-calais-learn Also feel free to follow us on Twitter (@OpenCalais) or to email us at [email protected]

Now to the answer:

OpenCalais is based on a decade of research and development in the fields of Natural Language Processing and Text Analytics.

We support the full "NLP Stack" (as we like to call it): From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases.

Semantics come into play when we look for Entities (a.k.a. Entity Extraction, Named Entity Recognition). For that purpose we have a sophisticated rule-based system that combines discovery rules as well as lexicons/dictionaries. This combination allows us to identify names of companies/persons/films, etc., even if they don't exist in any available list.

For the most prominent entities (such as people, companies) we also perform anaphora resolution, cross-reference and name canonization/normalization at the article level, so we'll know that 'John Smith' and 'Mr. Smith', for example, are likely referring to the same person. So the short answer to your question is - no, it's not just about matching against large databases.

Events/Facts are really interesting because they take our discovery rules one level deeper; we find relations between entities and label them with the appropriate type, for example M&As (relations between two or more companies), Employment Changes (relations between companies and people), and so on. Needless to say, Event/Fact extraction is not possible for systems that are based solely on lexicons. For the most part, our system is tuned to be precision-oriented, but we always try to keep a reasonable balance between accuracy and entirety.

By the way there are some cool new metadata capabilities coming out later this month so stay tuned.

Regards,

Michal