views:

249

answers:

3

Heya,

I'm doing to preface this with the fact I'm a relative Java/Scala newbie so I wouldn't rule out that there is something obvious I'm not doing.

I've got a Scala application which connects via Hibernate to a MySQL database. The Application is designed to process a large amount of data, about 2,750,000 records so I've tried to optimise it as much as possible.

It's running on my workstation which is a QuadCore Intel Xeon with 6Gb of RAM (at 1033Mhz) and it runs nicely and speedily for the first 70k records, completing them in about 15 minutes. By the time, it's got to 90k, it's taken about 25 minutes so something is making it slow to a crawl.

I've checked the timers on the Hibernate code and the database retrieval is taking the same about of time as usual. I've even tried forcing manual Garbage Collection to try and do that but that isn't working either.

The code in question looks something like:

val recordCount = repo.recordCount
val batchSize = 100
val batches = (0 to recordCount by batchSize).toList
val batchJobs = {
    for (batchStart <- batches) yield {
        future(new RecordFormatter().formatRecords(new Repo(sessionFactory.openSession),batchStart,batchSize)
    }
awaitAll(100000,batchJobs: *_)

Inside the RecordFormatter (which isn't actually named that in case you wonder at my naming scheme madness), it does a query for the next 100 records then another query to pull back the actual records (using between on the start and end values) then writes them out to a text file as CSV. Looking at the timer output, each operation within the record formatter takes about 5 seconds to pull back the records and then 0.1 of a second to output it to file.

Despite this once it has slowed down, it is only processing about 12 batches of 100 records per minute as opposed to 40 batches of 100 records per minute when the process first starts.

It's flushing the Session at regular intervals and closing it at the end of each RecordFormatter run (each RecordFormatter has its own session).

I'm mostly looking for any known gotchas with Scala and the Futures. I have noticed that when its slowing down, it doesn't seem to be using all eight possible threads which could certainly explain the drop in speed but it's a mystery to me why it would suddenly stop and always around the 75k record mark.

Thanks!

EDIT: Updated code to show it uses yield and awaitAll in case that makes a difference.

+2  A: 

It looks like a memory problem. I'd get a dump on memory usage, see how it is behaving. If the gc times are increasing too much, you have your culprit. You could, then, just increase the memory available to the JVM to get it going again.

Anyway, don't convert batches into a List. It is unnecessary. It would be necessary if you were using for/yield (on Scala 2.7), but since you are not yielding anything, then Range is a better choice.

Daniel
Sorry. I should have included that it does do a yield because it keeps a list of the Futures then does awaitAll() to wait for them to be done before moving onto the next section.Memory usage could be the issue but I'm not sure why it wouldn't be releasing the memory as I can't spot anything leaky. I'm allocating 800M as it currently stands.
Wysawyg
@Wysawyg - have you used the `jconsole` (`$JAVA_HOME/bin/jconsole`) to attach to the application? This is very good for telling you a few things: 1. Is the app spending all of its time doing GC? 2. What are my threads doing?
oxbow_lakes
+2  A: 

The jconsole application, which comes bundled with the JDK (in $JAVA_HOME/bin/jconsole) can be used to attach to the application as it runs. This is very good for telling you a few things:

  1. Is the app spending all of its time doing GC?
  2. What are the application threads doing?

Could you post the results here?

oxbow_lakes
Hey, thanks for the suggestion. I'm running jconsole but nothing is standing out as bad. GC has so far spend 2 minutes 30 seconds out of 1 hour 8 minutes runtime. I can't see anything off about what the threads are doing either: I suppose I need to profile it when it's speeding along and then again once it's started crawling and play spot the difference. Thanks for the advice.
Wysawyg
The normal thing to spot if the issue is GC, is that the GC number starts to go up more or less in realtie. i.e. it might be 2 mins out of 60 now, and 4 out of 62 in a few minutes' time. This means the last 2 mins was spent entirely in GC
oxbow_lakes
As an example, I have an app of mine that has spent less than 3 seconds in GC out of 24 hours! Another one (with >20k actors) takes over 10% of its time in GC!
oxbow_lakes
One thing I have noticed is that the thread responsible for the thread managing is spending most of its time in waiting, more so than at the start.All the other 8 threads are marked as Runnable and have about 1k time in blocked and none in waiting so it seems like those threads are ready to do the work but somehow the work isn't being assigned to them.Does that sound at all plausible?I'm trying it now with separating the full record count into 8 batches and firing off an actor for each and they run each 250 batch. That way I can see if it is anything with my use of actors or my other code.
Wysawyg
OK - so this seems like a difficult one. As I said, I have an app running over 20k actors processing market data across all regions in real time. This is probably not a bug with actors. If I were you, what I'd start looking for are things like comparisons (possibly hash-collisions?) which are being made by Hibernate, or some operation which is walking over all the results on every batch invocation
oxbow_lakes
Thanks for all your help. I'd upvote your answer if I had enough reputation. Hopefully I can get this solved before I lose the shreds of my sanity.
Wysawyg
+2  A: 

Try bounding the maximum number of threads that the actor library will create (futures are backed by actors). The actor threads are EXTREMELY heavyweight, and under certain conditions the scheduler will create them like there's no tomorrow. This uses up a ton of heap space and can make your program spend huge amounts of time performing garbage collection.

This can be done by setting the actors.maxPoolSize parameter at the command line...which would be something like: -Dactors.maxPoolSize=32 or whatever the max number of threads you want is.

I also highly recommend running your program -Xprof to see how much time the GC is consuming.

Erik Engbrecht