ansaurus

Question

Using Hibernate's ScrollableResults to slowly read 90 million records

Answer 1

A:

With 90 million records, it sounds like you should be batching your SELECTs. I've done with with Oracle when doing the initial load into a distrbuted cache. Looking at the MySQL documentation, the equivalent seems to be using the LIMIT clause: http://dev.mysql.com/doc/refman/5.0/en/select.html

Here's an example:

SELECT * from Person
LIMIT 200, 100

This would return rows 201 through 300 of the Person table.

You'd need to get the record count from your table first and then divide it by your batch size and work out your looping and LIMIT parameters from there.

The other benefit of this would be parallelism - you can execute multiple threads in parallel on this for faster processing.

Processing 90 million records also doesn't sound like the sweet spot for using Hibernate.

hbunny 2010-05-13 11:45:34

That doesn't work either... Try doing a select (batch or otherwise) where the offset is in the millions, it will take a very long time. I'm willing to bypass Hibernate, any suggestions for doing this without Hibernate?

at 2010-05-13 17:01:50

Try this article for a solution to the LIMIT performance: http://www.facebook.com/note.php?note_id=206034210932

hbunny 2010-05-13 21:02:17

Great article stevendick! Definitely helpful, thanks.

at 2010-05-15 10:24:43

Answer 2

+2 A:

Using setFirstResult and setMaxResults is your only option that I'm aware of.

Traditionally a scrollable resultset would only transfer rows to the client on an as required basis. Unfortunately the MySQL Connector/J actually fakes it, it executes the entire query and transports it to the client, so the driver actually has the entire result set loaded in RAM and will drip feed it to you (evidenced by your out of memory problems). You had the right idea, it's just shortcomings in the MySQL java driver.

I found no way to get around this, so went with loading large chunks using the regular setFirst/max methods. Sorry to be the bringer of bad news.

Just make sure to use a stateless session so there's no session level cache or dirty tracking etc.

EDIT:

Your UPDATE 2 is the best you're going to get unless you break out of the MySQL J/Connector. Though there's no reason you can't up the limit on the query. Provided you have enough RAM to hold the index this should be a somewhat cheap operation. I'd modify it slightly, and grab a batch at a time, and use the highest id of that batch to grab the next batch.

Note: this will only work if other_conditions use equality (no range conditions allowed) and have the last column of the index as id.

select * 
from person 
where id > <max_id_of_last_batch> and <other_conditions> 
order by id asc  
limit <batch_size>

Michael 2010-05-13 11:56:15

Using a StatelessSession is especially nice tip!

javashlook 2010-05-13 14:15:49

setFirstResult and setMaxResults is not a viable option. I was right in my guess that it would be unusably slow. Maybe that works for tiny tables, but very quickly it just takes way too long. You can test this in the MySQL console by simply running "select * from anything limit 1 offset 3000000". That might take 30 minutes...

at 2010-05-13 16:59:26

Running "select * from geoplanet_locations limit 1 offset 1900000;" against the YAHOO Geoplanet dataset (5 mil rows), returns in 1.34 seconds. If you have enough RAM to keep the index in RAM then I think your 30 minutes numbers are aways off. Funnily enough "select * from geoplanet_locations where id > 56047142 limit 10;" returns in essentially no time (regular client just returns 0.00).

Michael 2010-05-14 00:01:24

5 million rows is a big difference from 90 million..

at 2010-05-15 10:24:19

Answer 3

A:

I've used the Hibernate scroll functionality successfully before without it reading the entire result set in. Someone said that MySQL does not do true scroll cursors, but it claims to based on the JDBC dmd.supportsResultSetType(ResultSet.TYPE_SCROLL_INSENSITIVE) and searching around it seems like other people have used it. Make sure it's not caching the Person objects in the session - I've used it on SQL queries where there was no entity to cache. You can call evict at the end of the loop to be sure or test with a sql query. Also play around with setFetchSize to optimize the number of trips to the server.

Brian Deterling 2010-05-13 18:25:53

Answer 4

A:

The problem could be, that Hibernate keeps references to all objests in the session until you close the session. That has nothing to do with query caching. Maybe it would help to evict() the objects from the session, after you are done writing the object to the file. If they are no longer references by the session, the garbage collector can free the memory and you won't run out of memory anymore.

Reboot 2010-07-15 13:19:11

the problem is hibernate doesn't even return from the query until all rows are retrieved, so I couldn't even evict() anything until it's all loaded anyway.

at 2010-07-16 05:39:51

Sorry, I missed that in the question. If it is really a problem with the MySQL driver then there is probably no other options then splitting the query yourself into multiple queries, as it i was already posted. I'm using ScrollableResults with the jTDS driver for MSSQL and that helped to prevent OutOfMemoryErrors when processing large datasets from a database, so the idea itself is probably not wrong.

Reboot 2010-07-19 09:35:17

ansaurus

tags:

views:

answers:

Using Hibernate's ScrollableResults to slowly read 90 million records

related questions