views:

350

answers:

4

We are working on Arabic Natural Language Processing project, we have limited our choices to either write the code in Python or C++ (and Boost library). We are thinking of these points:

  • Python

    • Slower than C++ (There is ongoing work to make Python faster)
    • Better UTF8 support
    • Faster in writing tests and trying different algorithms
  • C++

    • Faster than Python
    • Familiar code, every programmer knows C or C-like code

After the project is done, it should be not very hard to port the project to another programming languages.

What do you think is better and suitable for the project?

+2  A: 

Familiar code, every programmer knows C or C-like code

Many devs are familiar with C or C-like code, it doesn't make them C++ compliant. Unexperienced C++ devs can do a lot of harm to such a complex project and you would have to take extra care.

I can't speak for python but I heard it's more beginner-friendly.

I'd say, once again, you should go for the language you (as a team) know best.

f4
Your first point is really interesting, but the second point isn't important for us we have a cool collection of programmers on both languages.
Khaled Al Hourani
+5  A: 

Write it in Python, profile it, and if you need to speed parts of it up, write them in C++. Python and C++ are similar enough that the "familiar" advantage with C++ will be irrelevant pretty quick.

I say this as someone who has developed primarily in C++ and has recently gotten serious with Python. I like them both, but I can get Python code working a lot faster than C++. Seriously, dict beats std::map in usability.

P.S. Here's some information on how to call C code from Python.

Mike DeSimone
+1 for algorithm development and prototyping python wins; it's easy to then move computationally intensive parts into a c/c++ module if need be.
Autopulated
I did that already, C++ has amazing execution time, however we may ignore this regarding Python pros.But your idea is really cool and pragmatic.
Khaled Al Hourani
@Kahled: That's been our experience. Sure, I can get FFTs going insanely fast (using fftw or MKL) in C++, but >95% of my code isn't `fft()`, it's decision making, initialization, and management. Also, that's the code that gets changed most of the time, not the inner-loop stuff. And when I did that part in Python, I was impressed with how it was measurably slower but not practically slower in my application, while being far faster to develop.
Mike DeSimone
+9  A: 

Although this is subjective and argumentative, there is evidence that you can write a successful NLP project in python like NLTK. They also have a comparison of NLP functionality in different languages:


(Quoting from the comparison)

Many programming languages have been used for NLP. As explained in the Preface, we have chosen Python because we believe it is well-suited to the special requirements of NLP. Here we present a brief survey of several programming languages, for the simple task of reading a text and printing the words that end with ing. We begin with the Python version, which we believe is readily interpretable, even by non Python programmers:

import sys
for line in sys.stdin:
    for word in line.split():
        if word.endswith('ing'):
            print word

[...]

The C programming language is a highly-efficient low-level language that is popular for operating system and networking software:

#include <stdio.h>
#include <string.h>

int main(int argc, char **argv) {
   int i = 0;
   int c = 1;
   char buffer[1024];

   while (c != EOF) {
       c = fgetc(stdin);
       if ( (c >= '0' && c <= '9') || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') ) {
           buffer[i++] = (char) c;
           continue;
       } else {
           if (i > 2 && (strncmp(buffer+i-3, "ing", 3) == 0 || strncmp(buffer+i-3, "ING", 3) == 0 ) ) {
               buffer[i] = 0;
               puts(buffer);
           }
           i = 0;
       }
   }
   return 0;
}

Edit: I didn't include comparable code in C++/Boost, so I add a code sample that does something similar, although not identical from the Boost documentation. Note that this isn't the cleanest version.

// char_sep_example_1.cpp
#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>

    int main()
    {
      std::string str = ";;Hello|world||-foo--bar;yow;baz|";
      typedef boost::tokenizer<boost::char_separator<char> > 
        tokenizer;
      boost::char_separator<char> sep("-;|");
      tokenizer tokens(str, sep);
      for (tokenizer::iterator tok_iter = tokens.begin();
           tok_iter != tokens.end(); ++tok_iter)
        std::cout << "<" << *tok_iter << "> ";
      std::cout << "\n";
      return EXIT_SUCCESS;
    }
Otto Allmendinger
+1 Above is directly applicable to the question and provides additional info.
Dana the Sane
thanks, it would be nice to know who -1ed me without comment though
Otto Allmendinger
The question is about python and C++/boost, this answer is about python and C. You can write a lot cleaner equivalent in C++ here
f4
I really appreciate your work and if I have a second correct answer, I'd give to you :)
Khaled Al Hourani
@Otto: I was an early upvote for you; your answer covers a lot I didn't. Sometimes people just hit you with a downvote and don't say why. It's annoying, but no big deal in the long run. Also, I'm no Boost expert, but the C++ solution doesn't seem to do what the Python or C solutions do... nothing in there involving `"ing"`...
Mike DeSimone
@Khaled no problem, doesn't matter that much. @Mike I haven't found analogue code for boost, the sample I found at least demonstrates word iteration. I'll try to find a more representative sample.
Otto Allmendinger
+1  A: 

This is more or less a reply/supplement to Otto Almendinger's answer. If you honestly wanted to implement something (roughly) similar to his Python example in C++, I think something like this would be closer:

#include <string>
#include <iostream>

int main() { 
    std::string temp;
    while (std::cin>>temp) 
        if (temp.size()>2 && temp.substr(temp.size()-3, 3)=="ing")
           std::cout << temp;
}

This does essentially the same thing as the Python does, and is about the same length as well -- the C++ has more syntactic "fluff", but they have exactly the same number of lines of code that really do anything (though there's no question that the individual lines in the C++ version are longer).

Don't get me wrong: I'm certainly not trying to claim that development with C++ will be as quick or easy as with Python. I do think the margin might be a tad smaller than some of the code presented here might imply though.

Edit: If you did want to claim C++ would be faster and easier, you could present code like:

for (std::string temp; std::cin>>temp; )
    temp.size()>2 && temp.substr(temp.size()-3, 3)=="ing" && std::cout << temp;

...along with a factually accurate (though grossly misleading) claim like: "The C++ code has only half as many statements as the Python implementation."

Jerry Coffin