views:

115

answers:

7

Greetings,

I have a large dataset (1GB of pure compressed text).

Right now I'm rewriting the dataset based on information in the data, for example:

  • Turn 2009-10-16 into Friday
  • Count the number of times something happen and how long they last for

Right now I'm doing all this in Java. I'm wondering if anyone knows of a tool or language which was actually designed to do this type of work. It is possible in Java but I'm writing a lot of boiler plate code.

Thanks

+5  A: 

Perl is the answer. It was created for manipulation of text data.

jitter
Whatever other position you may have in the Perl/PHP/Python wars, text manipulation is an area where Perl really stands out.
mobrule
+2  A: 

Perl

Bob
+3  A: 

An extended discussion about large data set manipulation in case of string data can be found here. It discusses more languages and their specific advantages, plus Unix/Linux shell scripting as an alternative option.

luvieere
+2  A: 

I use Python to do this type of stuff at work all of the time. The scripts are straight forward to write as Python is dead easy to learn and has wonderful documentation for libraries and core language features. Python, coupled with the command line, makes my like easy.

In your case, for just one file, I would write the script and just do:

zcat big_file.dat.gz | my_script.py

Or one could use Python's libraries for processing compressed files if you don't like command line work.

As also mentioned by others, Perl would be just as good. Either will do the trick.

Dr. Watson
+1  A: 

Depending on how the data is structured, you might want to not be focusing on the language, but the storage -- is this something that you can feed into a database and let the database do the heavy lifting?

Joe
A: 

I'd suggest using AWK. The first line of the Wikipedia entry says it all.

AWK is a programming language that is designed for processing text-based data, either in files or data streams

cdiggins
Perl is just as common (if not more so) than AWK, and was designed because AWK was insufficient and/or clumsy for certain tasks. And it's hard (and silly) to say that tool X is "more designed" for task Y than tool Z was.
Chris Lutz
Removed comparison to Perl. I don't agree with you, but I shouldn't have made such an incendiary remark.
cdiggins
A: 

I ended up using scala for this. I find it quite powerful for the job I'm doing. I can easily intergrate it into my java code.

steve