views:

1592

answers:

11

My graduate research is in Arabic Speech Recognition. My work involves dealing with text alot for different kinds of tasks such as:

  • Cleaning up messy transcriptions, I work with diacritized text and it is very important that they are put in the right place. I use lots of Regular Expressions for that.
  • Experimenting with search algorithms, such as the Viterbi Algorithm, stack search, etc.
  • Building lexical trees, and other NLP structures.

So far I was using PHP as my scripting language of choice, for no reason except that I'm used to it. I don't like using Java or C# for such a task since I feel my time gets wasted on object oriented design more than writing the logic I want, but maybe that's just my lazy brain.

Anyways, what do you suggest? shall I stick with PHP? or step back to Perl? or shall I use Python instead? Or is it better that I warm up my skills in .net and get to use C#?

+2  A: 

As much as it makes me cringe (personally), Perl is very common in the NLP/CL community, and there are many interesting and helpful packages in the CPAN. If you meet a computational linguist, if they know one programming language, it's usually Perl (or Prolog...).

My personal suggestion would be Python, especially because of the many good (and fast) libraries for number crunching, like NumPy and SciPy.

If you want to go for one of the really mainstream languages, I'd advice against C#. SO gives a somewhat overblown impression of C#'s importance. In academia, Java is the #1 contender. My perspective on that again might be warped, but at least for Germany, I can say that I haven't seen a curriculum in CL/NLP that includes C#. It's all Java, and to a smaller degree, Perl (on the way out), which is replaced by Python or Ruby.

EDIT: In case it didn't become clear, I strongly advice against sticking to PHP, simply because there are very, very few people that use it where you're going, at least as far as I can tell.

Torsten Marek
Lisp is also fairly common for NLP, but I don't know that I'd recommend it unless you already were familiar with it.
Adam Bellaire
It's common with the "elderly", I've seen only few projects that were not started in the 90s using Lisp. But it's definitely a good idea to be able to read some Lisp.
Torsten Marek
+6  A: 

Of course the "best" language is subjective, but Python would be a good choice. There's a (free) book about doing NLP in Python to get you started - see http://www.nltk.org/book.

Rudd Zwolinski
NLTK is very important, but it doesn't provide much for speech req and working with speech data AFAIK.
Torsten Marek
Well for working with speech data I usually use HTK or CMU-Sphinx, both written in C. What I need for now is just text processing. Thanks for pointing to NLTK, I think it will be really helpful.
Mohamed Ali
+1  A: 

The one you're most familiar with.

Although, in your case, I might have to rephrase that. The one you're most familiar with that isn't PHP.

As a Perlite, I object ever-so-slightly to everyone's advocacy of Python over Perl, especially in this particular field, but any language can do it. Hell, you could use C and try to write a grammar in Flex and Yacc, and it would (theoretically) work (until you get to words that have different meanings in different contexts). But I'm trying not to be biased, so I won't recommend one over the other.

Use what you feel most comfortable with. Except PHP.

Chris Lutz
Come on, let's not turn this into a PHP-bashing contest;)
Torsten Marek
I love PHP. I rather like the fact that a variable can store anything. I find it interesting that hashes are arrays (or rather, arrays are hashes). I even like the long, cumbersome, specialized function names. For websites.
Chris Lutz
+6  A: 

PHP: Regex support is decent, but regexes are stored as strings that must be recompiled with every use. It is possible to segfault PHP using a badly written regular expression, something that's supposed to be impossible for an interpreted language. More complex algorithms and parsing libraries don't really exist for PHP. PHP's simplicity can often be a strength. Once you start writing more serious, long-running applications you'll find it's inscrutably poor garbage collection driving you mad and impossible to debug (something that's supposed not to be an issue for interpreted languages). I've written some relatively complex NLP products in PHP and ended up partly regretting it.

Python: Python is a "good" language; it's design is relatively well-thought out. It's regex object model is more advanced and often convenient (don't forget the re.DEBUG flag to let you see the entire parse tree for a given regex). However, certain common regex features are completely missing or misbehave, such as atomic groupings, in-line flags or nested flags being improperly scoped. Supposedly they're fixing these things for 2.7. There are some good NLP libraries for Python. Python's central package repository system(s) are often inconsistent or unreliable.

Perl: Very popular in the NLP world. Its designers somehow got the idea in their head that more complex syntax equals better and have been working steadily to make it as indecipherable as possible, leading to an equally steady decline in its popularity. Perl has been almost completely supplanted by PHP for web programming and is slowly losing out to Python for everything else. That said, CPAN is an absolute goldmine of useful NLP libraries, (often well-documented, even) and almost makes perl worth it for this domain.

Java: Java's syntax isn't that complicated, there's just a lot of it. Be prepared to wade through millions of pages of API documentation. Java has a reputation for slowness, I don't know how much this situation has improved lately. Relatively popular for NLP purposes.

C#: .NET runtime is virtually unheard of in NLP circles, despite its plurality on this site. I'm sure it would be more or less up to the task, but you may have difficulty sharing with others in the field.

C++w/Boost/flex+bison: Just for completeness. Probably a bad idea if you're used to PHP. But your programs will be FAST.

Conclusion:

Perl or Python. I would personally go with Python, due to its simpler syntax. I share your loathing of the gratuitous OO of Java and C#. Don't go with PHP, I speak from experience here, its flaws are pretty serious once you start writing something serious.

ʞɔıu
Thanks for the thorough comparison, I really liked your answer.
Mohamed Ali
One thing that I would like to see improvement in is the ability to define new Regxp "classes", like if I want to define vowels as one class then have a shortcut symbol, e.g. <V>, appear in my regular expressions instead of using variables. Does Python or any other language support such a feature?
Mohamed Ali
Actually flex/lex supports that. but none of the other languages do
ʞɔıu
Java has a reputation for slowness ? What year are you in ? 1995 ? Are you using Java 0.1 ?
Geo
For C# you do have some tools for NLP like Antelope http://www.proxem.com/Default.aspx?tabid=55 (made by Proxem), similar to SharNLP (sharpnlp.codeplex.com).
Junior Mayhé
A: 

Since it seems to be getting mentioned a bunch: lex/yacc/flex/bison/whatever aren't appropriate for the kind of parsing work that needs to be done in NLP work.

Python is certainly on top where I'm at for this sort of work. Java/C#/.net-anything gets laughed at. Perl has been on the way out.

You're going to need to make a three-way tradeoff between 1) what you can get in terms of external modules & code, 2) ease of use for you, and 3) (certainly for speech work) performance. Pick something high level that'll let you make use of C stuff conveniently, and that'll probably address 1 and 3.

Jay Kominek
how about Python's performance?
Mohamed Ali
+1  A: 

If I was experimenting in the field of NLP I would gravitate toward a language more associated with AI work, like Lisp or a forward chaining Rule Based language like Prolog or one of the many systems based on the Rete Alogrithm such as OPS5

Noel Walters
+1  A: 

I would suggest going along the python route. You have pretty good and active projects which are used widely in NLP work - NLTK, MontyLingua, etc. I may be biased though as I work mostly on python.

Though I have known some NLP researchers who know nothing more than perl and would write some cryptic code which cannot be deciphered by anyone else.

Though doing a prototype would be very simple in perl, no one I know uses it for production use.

cnu
+2  A: 

Here are SDKs in around a half-dozen languages (Java,Perl,Python,C#,etc) for doing various natural language processing tasks: named entity extraction, text categorization, keyword extraction, etc. Just use whatever language you're most familiar with (and has the appropriate libraries/tools you need):

http://www.orch8.net/api/tools.html

A: 

I've had to use PHP for some web based projects; but each time I rediscover how much I hate it.

I taught myself Python for a project in unsupervised parsing; now, I haven't tried C for anything like that, but I didn't get the impression that Python's performance was that bad. Running 30 EM iterations--chart parsing all possible dependency trees--on the WSJ-10 took a couple of hours on a regular Macbook. The NLTK project for Python is really nice too; reading some pages of their online book taught me the language in no time.

unhammer
A: 

Also, consider Lisp. If your system doesn't have to interact with anyone else's system, and want to do something interesting, you could give it a try. While falling out of favor in many sects of mainstream development, Lisp has a long, long history of processing all sorts of data. Natural language processing is one of the big fields of AI study, and most folks who do AI consider Lisp to be the top language for their field.

Robert P
A: 

Wolfram|Alpha seems to think that Mathematica is the best language for nlp. It wouldn't be my first choice, but probably the key to nlp is being able to analyse huge amounts of exiting language.

I used to parse sentences just fine in Applesoft Basic. I'm sure a good regex library helps get you started, but it won't help your program infer meaning.

brianegge