views:

470

answers:

7

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.

Update: Just to answer one of the comment. I am more interested in Text Information Extraction.

+1  A: 

The Wikipedia Information Extraction article is a quick introduction.

At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.

Jeff Moser
+3  A: 

I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).

Fabian Steeg
The book is on information retrieval not extraction...
StackUnderflow
Yes, but as I write in my answer, I believe it covers areas that form a solid basis for Information Extraction. You asked for a place to start.
Fabian Steeg
I was going to recommend the Introduction to Information Retrieval book, but I only have the PDF and not the URL for where I found it. Thanks for posting the link.
John D. Cook
A: 

This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

John D. Cook
+5  A: 

Just to answer one of the comment. I am more interested in Text Information Extraction.

Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from textual information, and apply training, scoring, or classification. Good introductionary books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).

Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from plain text). You can use Wikipedia as a training corpus, since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.

The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and very good motivator -maybe you could reimplement some of their results as a learning exercise.

As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task are usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high quality info is currently in obscure white papers (Google scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.

Silver Dragon
+2  A: 

I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.

theycallmemorty
+1  A: 

Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.

Mark
+1  A: 

I disagree with the people who recommend reading Programming Collective Intelligence.If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence . For example, when it talks of SVM, it just says that libSVM is a good way of implementing them. Now libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend) .

IMHO , there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory . Bayes, i should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what i mean ,right ?) . Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.

Check these two for that :

http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html

http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html

http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html

Okay,now thats three of them :) / Cool

crazyaboutliv