views:

389

answers:

13

I'm about to embark on a project that will need to:

  • Process XML
  • Heavy text parsing of non-xml documents
  • Insertion of data from xml and non-xml documents into a relational DB.
  • Present processed data to user from db using webpages.
  • Must handle load very well.

The website will be subject to short periods of very heavy loads to pages (300+ visitors a minute for several minutes), but most of the time will be idle (a dozen or so visitors a minute).

I have a very strong background in Java and web services, but I do not want to use Java for this project as I'd like to diversify my skill set.

I'm not looking for your opinion on which language you think is best. What are some pros and cons from using these languages that you might recognize from your own experiences?

A: 

Depending on your needs you may want to consider a framework that already supports caching, Drupal is one example but there are many others. Most frameworks are extensible so you can add plugins to handle all the parsing and presentation.

I think language is less important than the framework you choose. I would personally choose PHP over Perl, because I think it is more applicable in the real world. Python is another beautiful scripting language, but php has the most traction in the web world. If you goal is to make your skill set more marketable, go with PHP.

vfilby
What's "PERL"? http://faq.perl.org/perlfaq1.html#What_s_the_differenc
David Dorward
Seriously, you are that picky about nomenclature?
vfilby
@vfilby Yes, we are that picky about referring to the language by its correct name.
Sinan Ünür
when anyone refers to 'PERL' there's a good chance that they are unfamiliar with modern Perl. the same goes for when people always refer to it only as a 'scripting language'
plusplus
I'll admit that I haven't used *Perl* in 5 years, but that doesn't change my argument. If you want more experience then the one with the most market penetration is the one you should go with. It makes you more marketable as a developer.
vfilby
+8  A: 

Since I'm a PHP guy, here is what I can offer about PHP

So the requirements to a language from your question are met by PHP.

However, Perl, Python or Ruby or even ServerSide JavaScript (...) should all be capable of doing what you are asking for either. PHP has it's quirks, so do the other languages. If you are a Java Guy, you might like Ruby for it's syntax, but then again, only you can decide.

Gordon
PHP 5 has the SimpleXML Class which makes working with XML very easy.
Xeoncross
Gordon, thank you for these excellent references. While they are definite pros for PHP, can you outline any cons I might encounter? Another answer mentions problems with UTF8, can you confirm or deny such problems exist?
Clinton
@Clinton Supporting Unicode *can* be troublesome. There is a number of extensions for working with multibyte and various character encodings available though. See http://de.php.net/manual/en/refs.international.php and this IBM article http://www.ibm.com/developerworks/library/os-php-unicode/index.html - some criticism I am aware of is the inconsistent naming schemes of function names and param order and the verbosity of the syntax. Have a look at http://de.php.net/manual/en/langref.php. If you're from a Java background, the OOP chapter will be of special interest to see what PHP offers here.
Gordon
@Sinan I really don't understand your angry tone nor your definition of "Subjective" here. Gordon is offering hard info why PHP could, among a broad variety of other languages, fulfill the requirements stated by the OP. He is not saying "it will work better than (Perl|Ruby|any other language)", nor "I would go with PHP"
Pekka
All right, both of you, no fighting in the hallways! :)
DVK
@Sinan - I must admit that while I generally find PHP vs. Perl fanboyism as distasteful as you do, this answer is in fact VERY inoffensive for me due to explicit "other languages should all be capable of doing what you are asking for either" from the get go. Although that makes it slighly less useful for OP's problem since it does not, in fact, provide any marginal reasons for choosing one or the other :)
DVK
A: 

All mentioned languages should be usable for your purpose. But as far as I know PHP could be a little bit tricky regarding UTF8 strings (e.g. getting the right string length for UTF8 character which consists of multiple bytes). But I'm sure some guys will provide good solutions for PHP via comments soon :-)

My personal favorite is Ruby. As it provides for all your needs really easy and powerful APIs (so called gems).

Achim Tromm
Some of the non-xml data being posted by users will be in German or Russian, and therefore I need the parsing to properly handle such cases. Is UTF8 character handling a known problem with PHP?
Clinton
UTF8 is not supported by native strings in PHP5. So you might run in trouble if you use them (e.g. strpos() returns amount of bytes and not the amount of characters). So you would have to consider particular utf8 string functions. Or you'll wait for PHP6 as it is considered to support UTF8 there for native strings, we will see.
Achim Tromm
PHP 5 does not have native support for Unicode or multibyte strings, unlike Perl and Python, but there is the mbstring module. This problem will be fixed in PHP 6, but that hasn't been released yet.
Leon Timmermans
+9  A: 

I'd go with Perl. The LibXML series of modules gives a variety of interfaces (DOM, XPath, XSLT, etc.) backed by a fast C parser.

Perl's regex support for slicing and dicing text is pretty much unmatched by any other language. If you expect to do lots of arbitrary text processing, and are at least a little familiar with regex, you will thank yourself.

There are also a series of great web frameworks for Perl, including the simple but powerful Mojolicious framework, and the comprehensive Catalyst framework. There's always the ancient and stable CGI library, but Mojolicious or Catalyst would probably be better choices.

David Dorward
Just to be crystal clear if you don't already know this: whether you use Perl or PHP or something else, NEVER EVER use a DOM XML parser for large XML documents unless your server has unlimited memory :)
DVK
A: 

As far as I'm aware, PHP's regex (which I would assume is what you'll use) PCRE library came from Perl. So if you have a lot of non-XML parsing then you need to test both and see which one runs faster. I'm not sure which one is faster for you neededs.

They both handle XML well (finally).

However, PHP is just a massive community. There is no other scripting language on the planet as large. So if that matters to you then use PHP since you can find everything under-the-sun about it.

However, Perl also has a large following and I'm sure there are plenty of tutorials for everything you would want to do.

Python is also a language you might want to look into. Heck, since everyone realized Ruby was God's gift to the world it has exploded too! You can honstly do what you want in any language so you need to look at the syntax of each of them and figure out which one you like best. From there you can run a simple example benchmark in each one to see which language is the fastest for you neededs.

Whatever you do - don't use a "framework" like wordpress or drupal. They are CMS's not frameworks and are so slow and bloated. Wordpress takes 8MB just to load the index page!

We had a PHP project and a Guy from Java joined us and was up and running in a week or two once he got the hang of everthing.

Xeoncross
Clarification: The PCRE library was based on perl regexen, but aren't exactly the same.
Robert P
A: 

Why don't you try Ruby on Rails?

Coming back to your question i would say PHP. Since you need to learn something new and at the same time you should have a great community where you can find support.

PHP does all what you have requested.

So what is your recommendation, RoR or PHP?
Achim Tromm
ROR is one which my heart says to GO? But since the OP's question was perl or PHP, i recommend him PHP as you have a lot of support.
+4  A: 

As it appears the bulk of your work will be processing data more than presentation, in my opinion this is what Perl does best. Perl does perform very well with regular expressions and the vast array of modules on CPAN can help you parse commonplace formats. There are also a good few frameworks in Perl that will make life easier in the presentation of the data. The major disadvantage for a newcomer, is with the tens of distributions on CPAN for each of the various problems you may encounter (XML parsing, web framework, ORM etc), it can be hard to make decisions as to which one to use. Thanks to Plack/PSGI, talking to webservers with Perl in recent times has gotten much, much better.

It's important that "load" is a problem that is completely language agnostic, so it is not what language you choose, it is how you engineer your system that will determine how well it handles increased load. Perl, Java, PHP have all been used in small setups all the way through to some of the most heavily trafficked websites on the net. If growth is on your future needs, decouple where appropriate and design for future expansion first. Multiple database servers, caching, message/work queues can be used in the small scale, and putting them in when things are small is easier than having to rewrite or quickly hack them in when demand for more resources is needed.

squeeks
+5  A: 

It is, indeed, very much a subjective question. I can totally conceive that in 2010, Perl or PHP (and even Python or Ruby) could equally serve you for such a project. The difference is not going to come from the language itself as much as the tools, best practices and community.

Among these languages, I am most familiar with Perl, so let me try to offer an answer from that perspective, regarding your needs.

Text and XML parsing: Perl has very robust support for text parsing of even very long files (as long as you don't slurp), and allows powerful, clear and easy regex programming. It has clear built-in Unicode support and standard trans-encoding tools (the Encode module), which is very handy when it comes to user interfaces. It also has a direct binding for libxml2 in the form of a standard, fast and well-maintained module: XML::LibXML.

Relational DB Support: In addition to the standard database interface (DBI) which allows direct SQL queries to a number of DBMSes, there are a number of frameworks to make DB-to-Webdoc management easier while still powerful. The most famous probably being Catalyst.

HTML Document presentation: Mason is my favorite web application delivery engine. The integration with Perl is so elegant, yet it does not sacrifice templating patterns or language features.

Heavy load handling: There are as many solutions as there are load problems to solve. Perl offers bindings for memcached: Cache::Memcached (written in Perl) and Cache::Memcached::Fast (written in C).

Balance that out with your personal preferences regarding syntax and general language philosophy, and you could very much join the Enlightened Perl community quite soon :)

Coox
+6  A: 

Therefore, every single item on your list can be done using both languages. You should choose the one you believe will make you most productive taking into account your own strengths and weaknesses.

Sinan Ünür
Sinan, thank you for these references. Like I asked Gordon above, can you think of any caveats I might experience while using Perl?
Clinton
A: 

I would use Common Lisp.

  • Closure XML for parsing XML
  • cl-ppcre is a perl-compatible regular expression library, but depending on what kind of text you want to parse, you can perhaps find specialized parsers at the Common Lisp Directory.
  • I don't know what database you want to use, but Postmodern is very nice for Postgres. There is also the more generic CLSQL.
  • You can use Hunchentoot as a webserver and, e.g., CL-WHO to produce HTML pages. 5 pages per second should be no problem.
Svante
A: 

Ok, so everyone is been subjective in their answers I'll add mine too.

Use Java, the core supports all you need (no frameworks needed), its free, OS and its 2 to 3 times faster than Perl - PHP.

Seriously... PHP is designed for Web projects, its easy, and support all you need to do (try Zend framework), it has a decent learning curve (Java is harder to learn), there is a huge community of developers out there to help you if you run into something unexpected (bigger than Pearl's and Java's). On performance, its a little slower than pearl (im talking about plain'old PHP scripts, no wierd-vodoo optimizations) but its enough for what you probably need.

In the end I'm pretty sure you will get a smaller-consistent app if you use PHP ( and if follow all the coding and design best practices) than you will ever get using Perl.

(Java is way better... but I don't want to be verbally lynched by some PHP zealot)

Chepech
The question rules out one and only one language … Java. And "Pearl"? Really?
David Dorward
Well as I though, I just got lynched by all the zealots availableNo one really read my answer through... =) C'mon the fist paragraph is a JOKE people!.. in retrospective not a very good one considering the results, but try to read the rest...shame on you people! =P
Chepech
A: 

Use Perl, if you have experience with neither and your goal is to make yourself more marketable.

It's much easier to fake PHP experience if you need to defend both entries in your 'professional experience' section.

BojanG
+1  A: 

Your architecture and algorithms will have more impact on speed and scalability than choice of language.

Perl, PHP or Java will all do the job.

I'd do this in Perl since I know it well and prefer it to PHP (which I also know well). YOur mileage will vary.

daotoad