views:

83

answers:

3

Hello,

My application currently stores millions of Double elements for a calculation. These values are only temporary values before they are used for a specific algorithm that is run at the end of the calculation. Once this calculation is done, the millions of values can be discarded.

The full story is here, if you need more details.

One of the solutions that was proposed is to use an in-memory database.

So if I go with this solution, I will use this database to store my values in a table to replace my current Map<String, List<Double>>, like:

create table CALCULATION_RESULTS_XXX (
  deal_id varchar2,
  values number
);

(one table per calculation, XXX is the calculation ID)

So during the calculation, I will do the following:

  1. When the calculation is started, I create the CALCULATION_RESULTS_XXX table.
  2. Every time I need to add a value, I insert a record in this table.
  3. At the end of the calculation, I use the table content for my algorithm.
  4. Finally, I drop this table.

As explained in the other subject, currently, my calculation may store several hundreds of Mb of data in the memory, as a list of 30 * 1,000,000 of Double will need about 240Mb.

The questions now:

  1. If I go with an in-memory database, does my memory consomption will be decreased?
  2. What are the specific points that I will have to take care regarding the database usage (or table creation), the data insertion, etc. ?
  3. I think I will choose H2 database. Do you think it's the best choice for my needs?
+1  A: 

A simple HashMap backed up by Terracotta would do better and will allow to store collection bigger then JVM virtual memory.

Embedded databases, especially, the SQL-based ones, will add complexity and overhead to your code, so it doesn't worth it. If you really need a persistent storage with random access, try one of nosql DBs, like CouchDB, Cassandra, neo4j

Eugene Kuleshov
downvoter could say why...
Thorbjørn Ravn Andersen
I agree, downvoter should say why. (For only a few million entries (240MB on my 4GB desktop..., I'd be tempted to try a Map first just to see what happens.) Eugene's answer is easy to implement and thus would be a good baseline test if nothing else.
Tony Ennis
@Tony Ennis I don't disagree that Terracotta might be easier, but you'll still have to find the memory somewhere. And 240M of data can also be handled by a single JVM, no need for a clustered JVM to do that.
extraneon
@extraneon with Terracotta the "memory" will simply go to hard drive. You can get a decent SSD to make it quite fast.
Eugene Kuleshov
@Eugene Kuleshov wouldn't that be more easily achieved by using ehcache with overflowToDisk true and an appropriate expiry time? It's not strictly a cache then, of course, but should also work.
extraneon
A: 

I don't know whether it will be faster, so you'd have to try it. What I do want to recommend is to do batch inserts of an entire list when you don't immediately need that list anymore. Don't save value by value :)

If you're end algorithm can be expressed in SQL it might also be worth your while to do that, and not load all Lists back in. In any case, don't put anything like an index or constraint on the values, and preferably also not allow NULL (if possible). Maintaining indices and constraints cost time, and allowing NULL can also cost time, or create overhead. deal_ids can (and are) of course indexed as they're primary keys.

This isn't very much but at least better than a single down-voted answer :)

extraneon
A: 

The problem is sufficiently simple that you really need to just give it a go and see how the (performance) results work out.

You already have an implementation that just uses simple in-memory structures. Personally, given that even the cheapest computer from Dell comes with 1GB+ of RAM, you might as well stick with that. That aside, it should be fairly simple to wack in a database or two. I'd consider Sleepycat Berkerly DB (Which is now owned by Oracle...), because you don't need to use SQL and they should be quite efficient. (They do support Java).

If the results are promising, I'd then consider further investigation, but this really should only take a few days work, at most, including the benchmarking.

Arafangion
I did a test last friday using `H2 database` but it wasn't really satisfactory, regarding both memory and cpu usages. As I finally resolved my original problem without using a complex solution, I will not go into further investigations.
romaintaz
Indeed, to quote Ockham, "Frustra fit per plura quod potest fieri per pauciora". :)
Arafangion