Loading and analyzing massive amounts of data

views:

255

answers:

+5 Q:

Loading and analyzing massive amounts of data

So for some research work, I need to analyze a ton of raw movement data (currently almost a gig of data, and growing) and spit out quantitative information and plots.

I wrote most of it using Groovy (with JFreeChart for charting) and when performance became an issue, I rewrote the core parts in Java.

The problem is that analysis and plotting takes about a minute, whereas loading all of the data takes about 5-10 minutes. As you can imagine, this gets really annoying when I want to make small changes to plots and see the output.

I have a couple ideas on fixing this:

1) Load all of the data into a SQLite database. Pros: It'll be fast. I'll be able to run SQL to get aggregate data if I need to.

Cons: I have to write all that code. Also, for some of the plots, I need access to each point of data, so loading a couple hundred thousand files, some parts may still be slow.

2) Java RMI to return the object. All the data gets loaded into one root object, which, when serialized, is about 200 megs. I'm not sure how long it would take to transfer a 200meg object through RMI. (same client).

I'd have to run the server and load all the data but that's not a big deal.

Major pro: this should take the least amount of time to write

3) Run a server that loads the data and executes a groovy script on command within the server vm. Overall, this seems like the best idea (for implementation time vs performance as well as other long term benefits)

What I'd like to know is have other people tackled this problem?

+1 A:

If your data have a relational properties, there are nothing more natural than storing it at some SQL database. There you can solve your biggest problem -- performance, costing "just" to write your appropriate SQL code.

Seems very plain to me.

Rubens Farias 2009-11-04 02:05:48

+5 A:

Databases are very scalable, if you are going to have massive amounts of data. In MS SQL we currently group/sum/filter about 30GB of data in 4 minutes (somewhere around 17 million records I think).

If the data is not going to grow very much, then I'd try out approach #2. You can make a simple test application that creates a 200-400mb object with random data and test the performance of transferring it before deciding if you want to go that route.

Ztranger 2009-11-04 02:06:02

I know databases are overall the best bet and the most scalable and what not. If I was writing an actual application, it would be no question.I think you're right though, if #2 can be accomplished with a minimal performance hit (since it can be implemented in about 5 lines of code), that may be my best bet.

Milan Ramaiya 2009-11-04 02:35:59

@Rev - not "the most scalable". Technologies like Hadoop are more scalable.

Stephen C 2009-11-04 04:37:42

+1 A:

I'd look into analysis using R. It's a statistical language with graphing capabilities. It could put you ahead, especially if that's the kind of analysis you intend to do. Why write all that code?

duffymo 2009-11-04 02:07:53

That's a good idea, but not exactly feasible right now or for this project.While I've heard of R, I can't rewrite all my data analysis in a different language, while learning it.

Milan Ramaiya 2009-11-04 02:31:23

Ah, yes: large data structures in Java. Good luck with that, surviving "death by garbage collection" and all. What java seems to do best is wrapping a UI around some other processing engine, although it does free developers from most memory management tasks -- for a price. If it were me, I would most likely do the heavy crunching in Perl (having had to recode several chunks of a batch system in perl instead of java in a past job for performance reasons), then spit the results back to your existing graphing code.

However, given your suggested choices, you probably want to go with the SQL DB route. Just make sure that it really is faster for a few sample queries, watch the query-plan data and all that (assuming your system will log or interactively show such details)

Edit,(to Jim Ferrans) re: java big-N faster than perl (comment below): the benchmarks you referenced are primarily little "arithmetic" loops, rather than something that does a few hundred MB of IO and stores it in a Map / %hash / Dictionary / associative-array for later revisiting. Java I/O might have gotten better, but I suspect all the abstractness still makes it comparitively slow, and I know the GC is a killer. I haven't checked this lately, I don't process multi-GB data files on a daily basis at my current job, like I used to.

Feeding the trolls (12/21): I measured Perl to be faster than Java for doing a bunch of sequential string processing. In fact, depending on which machine I used, Perl was between 3 and 25 times faster than Java for this kind of work (batch + string). Of course, the particular thrash-test I put together did not involve any numeric work, which I suspect Java would have done a bit better, nor did it involve caching a lot of data in a Map/hash, which I suspect Perl would have done a bit better. Note that Java did much better at using large numbers of threads, though.

Roboprog 2009-11-04 02:21:48

Huh?? Perl is 30-100x *slower* than Java, see http://www.coderanch.com/t/201887/Performance/java/Java-vs-Perl-Speed or http://shootout.alioth.debian.org/u32/perl.php.

Jim Ferrans 2009-11-04 02:44:52

There are plenty of IO sins to commit in Java, but simply not doing it wrong can help a lot: http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/

Carl 2009-11-04 04:44:38

-1 - Skimmed your blog, and its mostly opinion (no verifiable facts) and a lot of factual inaccuracies. For instance no modern JVM uses a mark and sweep garbage collector. I suspect that a lot of your "poor results" with Java were actually caused by doing things the wrong way. But of course, there's no way of knowing without concrete examples.

Stephen C 2009-11-04 04:46:39

Roboprog 2009-11-04 15:17:47

This may have been true 8 years ago but Java has moved on even if your opinions of it haven't.

Jared 2009-11-04 18:30:28

+1, simply to negate the drive-by -1's from all the Java programmers...

MagicAndi 2010-01-02 19:33:11

I would recommend running a profiler to see what part of the loading process is taking the most time and if there's a possible quick win optimization. You can download an evaluation license of JProfiler or YourKit.

Jason Gritman 2009-11-04 02:37:59

+2 A:

Before you make a decision its probably worth understanding what is going on with your JVM as well as your physical system resources.

There are several factors that could be at play here:

jvm heap size
garbage collection algorithms
how much physical memory you have
how you load the data - is it from a file that is fragmented all over the disk?
do you even need to load all of the data at once - can it be done it batches
if you are doing it in batches you can vary the batch size and see what happens
if your system has multiple cores perhaps you could look at using more than one thread at a time to process/load data
if using multiple cores already and disk I/O is the bottleneck, perhaps you could try loading from different disks at the same time

You should also look at http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp if you aren't familiar with the settings for the VM.

anger 2009-11-04 04:31:11

Profiler (for example, YourKit http://www.yourkit.com) will give you immediate answer.

Serge 2009-11-06 09:01:12

An immediate answer for what? Did you even read the question?

Milan Ramaiya 2009-11-06 15:44:23

An answer for what is the real performance bottleneck and what should be optimized and why your loading algorithm is so slow. Most probable, you'll some other ideas.

Serge 2009-11-19 11:26:39

ansaurus

tags:

views:

answers:

Loading and analyzing massive amounts of data

related questions