views:

1801

answers:

6

Duplicate:

Best Language for String Manipulation?

I have to parse hundreds of text files per second, each file containing multi subject text (consider, for example, it's email text). I need to find various patterns (keywords, sentences, most important words and stuff like that). I need to know what is the fastest programming language to do that.

Note: I tried php and perl to find keywords using regular expressions. Is there any faster way to do that? and to get most important words and analyse the semantic of sentences what should i use?

I do have a list of keywords stored in a text file (probably will be in a ldap directory later). Example: " You've just registered in facebook with these stats: username: user1 password: passwdofuser1 "

I have to tag this text for containing words like "password" and "username" and retrieve user and password information to process later.

Sentence example: "Let's meet tomorrow at 5pm in st. john's restaurant." i have to get important information like "tomorrow" "5pm" and "st john's restaurant" to process it later.

thanks

+5  A: 

Won't get much faster than C. You may want to consider a bayesian filter for the pattern recognition, or, failing that, a proper LR(1) parser

dsm
A: 

I've read somewhere that regular expressions in TCL are faster than in Perl. Of course, using C should be the fastest to perform (but the slowest to write). ;-)

Kouber Saparev
+2  A: 

Have a look at Ragel State Machine Compiler. It can create parsers in many different languages e.g. Java, C/C++, Ruby. Ragel is really good if you want to perform different actions during the parsing. You can find many good examples on their website.

sris
+3  A: 

Before you consider the programming language, I suggest you consider your algorithm. Sounds like you want to detect words from a given (maybe dynamic) list. I believe regular expressions are too strong here. I suggest you tokenize the incoming text, using something like GNU flex or another lexical analyzer. Then, compare your tokens to the list and continue processing. You can do this using C, C++ or another language. Comparing pretokenized text should be faster than matching the whole text with regular expressions.

Yuval F
A: 

When I learned Python, the instructor was pretty convinced that Python would about match the speed of C in text processing. I haven't verified this, but according to him, the code Python does text processing with, is C code. As such the initial startup of the Python script would be slower, but the running of the script would be (almost) just as fast as a C program. The re module is a C extension for Python, so it doesn't get much faster than that. Well, at least not in Python terms.

wzzrd
In 1982 the InStr() function of Microsoft BASIC on my TRS-80 was entirely written in Z80 assembler, hand coded by Bill Gates. It was indeed a lot faster than my compiled Pascal. Interestingly, BASIC's string manipulation has always been surprisingly good.
Peter Wone
+8  A: 

Perl should be enough for you. What is fastest is wrong question. Correct question should be what is fast enough in runtime and also fast enough for development. Parallelize your work if you can a and run it on cluster if you want until you reach point when it is not enough (synchronization and complexity kills performance). What is fastest? Highly optimized ASM if you are expert to write this sort of code. You would not ask if it is "correct" answer for you.

For example in mine job there are tasks where Perl is faster than MySQL engine even sort of task looks like one which RDBMS is used to be used (full outer join, ETL, cleansing). C, Haskell, Ocaml or CL can be faster but Perl is fast enough.

Hynek -Pichi- Vychodil