views:

93

answers:

4

What would be the best regular expression for tokenizing an English text?

By an English token, I mean an atom consisting of maximum number of characters that can be meaningfully used for NLP purposes. An analogy is a "token" in any programming language (e.g. in C, '{', '[', 'hello', '&', etc. can be tokens). There is one restriction: Though English punctuation characters can be "meaningful", let's ignore them for the sake of simplicity when they do not appear in the middle of \w+. So, "Hello, world." yields 'hello' and 'world'; similarly, "You are good-looking." may yield either [you, are, good-looking] or [you, are, good, looking].

+1  A: 

You probably shouldn't try to use a regular expression for tokenizing English text. In English some tokens have several different meanings and you can only know which is right by understanding the context in which they are found, and that requires understanding the meaning of the text to some extent. Examples:

  • The character ' could be an apostrophe or it could be used as a single-quote to quote some text.
  • The period could be the end of a sentence or it could signify an abbreviation. Or in some cases it could fulfil both roles simultaneously.

Try a natural language parser instead. For example you could use the Stanford Parser. It is free to use and will do a much better job than any regular expression at tokenizing English text. That's just one example though - there are also many other NLP libraries you could use.

Mark Byers
tokenizing != parsing. He's talking about lexing (unless I miss my guess).
Paul Nathan
@Nathan you got that right. Byers is referring to a tagger, which is not my focus.
OTZ
@Paul Nathan: You can't *accurately* tokenize English text using a regular expression. If you only want it to work some of the time and don't care about errors then you can probably get away with using a simple regular expression. If you want it to work most of the time then you need something more powerful. You could keep extending the regex to cover more and more special cases, but seeing as the more powerful solutions already exist and are free, why not just use them from the start?
Mark Byers
@Mark: Pain of integration, for one thing. :-) OP hasn't discussed his target corpus. If it's a basic analysis, a regex will work. If it's for a more precise problem, of course you want a more developed system. At a guess, OP wants a basic hack, since an expert would frame the question much more precisely. Also Perl regexes are not true regexes, they are context-sensitive somethings.
Paul Nathan
A: 

You can split on [^\p{L}]+. It will split on each characters group which doesn't contains letters.


Resources :

Colin Hebert
What's that \p doing? Which language's regexp library r u using?
OTZ
A: 

There are some complexities.

A word will have [A-Za-z0-9\-]. But, you may have some other delimiters besides just the word! You can start with [(\s] and end with [),.-\s?:;!]

Paul Nathan
Noooo. Don't do this. Use \b instead. It matches a word boundary. So this would match a word: \b.+?\b
Rohan Singh
`\b` won't work properly if the word contains non-ASCII characters!
Daniel Vandersluis
@Rohan: That won't work for hyphenated words or apostrophe'd words. Also, this is *not* a full Perl regex. This is a *sample regex* meant to demonstrate in a non-Perl syntax a subset of possibility.
Paul Nathan
+1  A: 

Treebank Tokenization

Penn Treebank (PTB) tokenization is a reasonably common tokenization scheme used for natural language processing (NLP) work.

You can find a sed script with the appropriate regular expressions to get this tokenization here.

Software Packages

However, most NLP packages provide ready to use tokenizers, so you don't really need to write your own. For example, if you're using python you can just use the TreebankWordTokenizer provided with NLTK. If you're using the Java based Stanford Parser, it will by default tokenize any sentence you give it using its edu.stanford.nlp.processor.PTBTokenizer.

dmcer
@dmcer Thanks for giving us a pointer to the PTB tokenization method. While they don't enumerate what those "subtleties" are on hyphens vs dashes, and I'm not sure if "won't --> wo n't" or "gonna --> gon na" is appropriate, it can be a starter. +1
OTZ