views:

133

answers:

3

I have a table with around 20 columns with mostly consisting of varchars and decimals. This table has almost 1.5M rows. But few things are common in them like column1 consists of only 100 distinct strings , column2 has almost 1000 and column3 has almost 500.

Right now, I am storing all these column values in a map with Key as first 5 columns and Data as rest of columns. My task is such, I need to initialize all these at the start of the task.

What pattern(like Flyweight, etc) or data structure should I use to minimize my Object storage?

Why I need pre-load of all data?

Assume the whole data of the table as a tree and the victims can be at any leaf, trunk or at root. So for each entry[this is coming from different place], I need to see if there is any match in the tree.

+2  A: 

Internalizing is not the best option. Garbage collecting from the PermSpace is possible but nothing the VM is optimized for.

You can implement your own CharSequence implementation that is backed by shared char[] arrays.

With a CharSequence implementation you'll be able to implement basic sharing semantics like internalized strings or more complicated ones taking substrings and other projections into account.

A custom CharSequence implementation can also be optimized to perform fewer memory allocations than the String class which is copying char[] around (for safety reasons that are not necessary if you have the backing char[] under your full control). Even new String("..").intern() will intantiate a new String instance (char[] array) that is rapidly garbage collected.

Thomas Jung
You mean http://java.sun.com/j2se/1.5.0/docs/api/java/lang/CharSequence.html but how this will help in my case
DKSRathore
If you are free to replace String with CharSequence in your context you can have a custom implementation that performs better than the general purpose String implementation.
Thomas Jung
I think that this could help, but not able to realize this. Can you provide some implementation or some good links for details?
DKSRathore
Thomas Jung
Ok. Thanks, I'll look into this
DKSRathore
Just a note on Java's String class... it will already share substrings, etc. as you mention in your response. Maybe you can elaborate on what you mean by the "perform fewer memory allocations" part. String rarely copies char[] arrays around... it's an immutable class and so doesn't have to worry about sharing them internally. It only has to copy the char[] when you try to pull it out. You might be onto something but it wasn't clear (to me) from your description.
PSpeed
String implementations won't share a the char[] except for substrings. But other transformations can be backed by the same char[] arrays: deletion, transposition, insertion, replacement, concatination.
Thomas Jung
So in those latter cases, there would be some data structures keeping track of the changes to the underlying char[] array? I suppose there would be an easy trade off calculation for string length versus edit count and where it starts making sense to just copy that char[].
PSpeed
+1  A: 

My first question would be, what does you task plan with doing with the data in the table? Preloading a complete table into memory is not always the best approach, for instance keeping your current setup but loading on demand might be a better solution. And you might want to investigate flushing data that isn't used for a while, i.e. a kind of recently used map.

Could you elaborate what your task tries to achieve with all that data cached in a map?

Is the "victim" identification part of the key or part of the object? If part of the object, how do you select the keys that select the objects that you need? In other words; it sounds like you try to reproduce functionality that the database is very good at.

If your problem is that your table contents does not map easily on a tree-like structure, you could add that information in a way that is useable through the DB interface.

rsp
RSp, I added this in the question.
DKSRathore
A: 

If your data loading process can support it then it isn't too difficult to implement something like String.intern() without the GC permgen side effects.

For any hashable data element, you can simply have a Map<T,T> to look-up preexisting instances. So for String:

Map<String,String> stringCache = new HashMap<String,String>();
...
String sharedValue = stringCache.get(loadedValue);

The process that loads the data from wherever will still be creating temporary strings but these will be rapidly GC'ed. Without knowing more about the specifics of where the data is coming from, it's difficult to comment on whether those temporary objects are necessary... though I have trouble seeing a way around it. They would be reclaimed rapidly during the load process anyway.

PSpeed