tags:

views:

154

answers:

6

Task: Process 3 text files of close to 1GB size and turn them into csv files. The source files have a custom structure, so regular expressions would be useful.

Problem: There is no problem. I use php for it and it's fine. I don't actually need to process the files faster. I'm just curious how you would approach the problem in general. In the end i'd like to see simple and convenient solutions that might perform faster than php.

@felix I'm sure about that. :) If i'm done with the whole project i'll probably post this as cross language code ping pong.

@mark My approach currently works like that, with the exception that i cache few hundred lines to keep file writes low. An well thought through memory trade off would probably squeeze out some time. But i'm sure that other approaches can beat php by far, like a full utilization of a *nix toolset.

+5  A: 

Firstly it probably doesn't really matter much which language you use for this as it probably will be I/O bound. What is more important is that you use an efficient approach / algorithm. In particular you want to avoid reading the entire file into memory if possible, and avoid concatenating the result into a huge string before writing it to disk.

Instead use a streaming approach: read a line of input, process it, then write a line of output.

Mark Byers
A: 

http://hadoop.apache.org/pig/

GameBit
+1  A: 

I'd be reaching for sed.

High Performance Mark
A: 

Perl is the old grand master of text processing, for good reasons. A lot of Perl's strengths I believe is found in Python today though, in a more accessible way, so when it comes to text parsing I usually run to Python (I've done parsing on GB files with Python before).

AWK or sed is probably lightning-fast as well, but not as easily extendable as Perl och Python. In your particular case you don't want to do much more than just parse and reformat output, but if you would like to do more it would be easier to already be using Perl/Python.

I can't really find arguments against Python vs something else, so I guess that would be my suggestion.

Daniel Andersson
A: 

How would I process large quantities of text data you ask? perl -p 's/regex/magic/eg' (or some similar/more complex variation).

It's pretty much perfect for the job, excluding rare situations with the need for the most absolutely blazing fast maximum performance (where almost any scripting language fails).

It's widely available, fast, and concise. I'm in the process of teaching perl to a few coworkers, and they seem to be in a continuous state of awe regarding the seemingly miraculous feats it can perform in one or two lines of code. And joking aside, it's completely viable to do this while remaining quite readable (assuming you have a reasonable understanding of the language and no desire to create hell for future maintainers).

pdehaan
A: 

Have a look at nodeJS NodeJs. Create a static server in NodeJs (a few lines of JS code) and try to read the files using it's evented I/O mechanish. It scales really will and highly concurrent. You will observe really fast I/O operations. Give it a try and let me know.

A_Var