views:

572

answers:

8

Assume you know a student who wants to study Machine Learning and Natural Language Processing.

What introductory subjects would you recommend?

Example: I'm guessing that knowing Prolog and Matlab might help him. He also might want to study Discrete Structures*, Calculus, and Statistics.

*Graphs and trees. Functions: properties, recursive definitions, solving recurrences. Relations: properties, equivalence, partial order. Proof techniques, inductive proof. Counting techniques and discrete probability. Logic: propositional calculus, first-order predicate calculus. Formal reasoning: natural deduction, resolution. Applications to program correctness and automatic reasoning. Introduction to algebraic structures in computing.

+2  A: 

How about Markdown and an Introduction to Parsing Expression Grammars (PEG) posted by cletus on his site cforcoding?

ANTLR seems like a good place to start for natural language processing. I'm no expert though.

Binary Nerd
ANTLR and similar tools are oriented at formal languages, with well-defined, unambiguous grammars. NLP researchers have been trying to use these kinds of tools for decades, only to find that they simply don't work.
larsmans
+5  A: 

I would say probabily & statistics is the most important prerequisite. Especially Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs) are very important both in machine learning and natural language processing (of course these subjects may be part of the course if it is introductory).

Then, I would say basic CS knowledge is also helpful, for example Algorithms, Formal Languages and basic Complexity theory.

3lectrologos
+4  A: 

String algorithms, including suffix trees. Calculus and linear algebra. Varying varieties of statistics. Artificial intelligence optimization algorithms. Data clustering techniques... and a million other things. This is a very active field right now, depending on what you intend to do.

It doesn't really matter what language you choose to operate in. Python, for instance has the NLTK, which is a pretty nice free package for tinkering with computational linguistics.

San Jacinto
+5  A: 
Fabian Steeg
+2  A: 

Broad question, but I certainly think that a knowledge of finite state automata and hidden Markov models would be useful. That requires knowledge of statistical learning, Bayesian parameter estimation, and entropy.

Latent semantic indexing is a commonly yet recently used tool in many machine learning problems. Some of the methods are rather easy to understand. There are a bunch of potential basic projects.

  1. Find co-occurrences in text corpora for document/paragraph/sentence clustering.
  2. Classify the mood of a text corpus.
  3. Automatically annotate or summarize a document.
  4. Find relationships among separate documents to automatically generate a "graph" among the documents.

EDIT: Nonnegative matrix factorization (NMF) is a tool that has grown considerably in popularity due to its simplicity and effectiveness. It's easy to understand. I currently research the use of NMF for music information retrieval; NMF has shown to be useful for latent semantic indexing of text corpora, as well. Here is one paper. PDF

Steve
+3  A: 

Jurafsky and Martin's Speech and Language Processing http://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ is very good. Unfortunately the draft second edition chapters are no longer free online now that it's been published :(

Also, if you're a decent programmer it's never too early to toy around with NLP programs. NLTK comes to mind (Python). It has a book you can read free online that was published (by OReilly I think).

Yuvi Masory
+13  A: 

This related stackoverflow question has some nice answers: What are good starting points for someone interested in natural language processing?

This is a very big field. The prerequisites mostly consist of probability/statistics, linear algebra, and basic computer science, although Natural Language Processing requires a more intensive computer science background to start with (frequently covering some basic AI). Regarding specific langauges: Lisp was created "as an afterthought" for doing AI research, while Prolog (with it's roots in formal logic) is especially aimed at Natural Language Processing, and many courses will use Prolog, Scheme, Matlab, R, or another functional language (e.g. OCaml is used for this course at Cornell) as they are very suited to this kind of analysis.

Here are some more specific pointers:

For Machine Learning, Stanford CS 229: Machine Learning is great: it includes everything, including full videos of the lectures (also up on iTunes), course notes, problem sets, etc., and it was very well taught by Andrew Ng.

Note the prerequisites:

Students are expected to have the following background: Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program. Familiarity with the basic probability theory. Familiarity with the basic linear algebra.

The course uses Matlab and/or Octave. It also recommends the following readings (although the course notes themselves are very complete):

For Natural Language Processing, the NLP group at Stanford provides many good resources. The introductory course Stanford CS 224: Natural Language Processing includes all the lectures online and has the following prerequisites:

Adequate experience with programming and formal structures. Programming projects will be written in Java 1.5, so knowledge of Java (or a willingness to learn on your own) is required. Knowledge of standard concepts in artificial intelligence and/or computational linguistics. Basic familiarity with logic, vector spaces, and probability.

Some recommended texts are:

The prerequisite computational linguistics course requires basic computer programming and data structures knowledge, and uses the same text books. The required articificial intelligence course is also available online along with all the lecture notes and uses:

This is the standard Artificial Intelligence text and is also worth reading.

I use R for machine learning myself and really recommend it. For this, I would suggest looking at The Elements of Statistical Learning, for which the full text is available online for free. You may want to refer to the Machine Learning and Natural Language Processing views on CRAN for specific functionality.

Shane
+4  A: 

Stanford CS 224: Natural Language Processing course that was mentioned already includes also videos online (in addition to other course materials). The videos aren't linked to on the course website, so many people may not notice them.

michau
+1 Wow, old question, but thanks for the update.
Stephano