views:

54

answers:

2

[BACKGROUND] We are currently trying to solve a performance problem. Which is searching for data and presenting it in a paginated way takes about 2-3 minutes.

Upon further investigation (and after several sql tuning), it seems that searching is slow just because of the sheer amount of data.

A possible solution that I'm currently investigating is to replicate the data in a searchable cache. Now this cache can be in the database (i.e. materialized view) or it could be outside the db (nosql approach). However, since I would like the cache to be horizontally scalable, I am leaning towards caching it outside the database.

I've created a proof of concept, and indeed, searching in my cache is faster than in the db. However, the initial full replication takes a long time to complete. Although the full replication will just happen once, and then succeeding replication will just be incremental against those that changed since the last replication, it would still be great if I can speed up the initial full replication.

However, during full replication, aside from the slowness of the query's execution, I also have to battle against network latency. In fact, I can deal with the slow query execution time. But the network latency is really really slowing the replication down.

[ACTUAL QUESTION] So which leads me to my question, how can I speed up my replication? Should I spawn several threads each one doing a query? Should I use a scrollable? .....or?

A: 
  1. SELECT * FROM YOUR_TABLE
  2. Map results into an object or data structure
  3. Assign a unique key for each object or data structure
  4. Load the key and object or data structure into a WeakHashMap to act as your cache.

I don't see why you need sorting, because your cache should access values by unique key in O(1) time. What is sorting buying you?

Be sure to think about thread safety.

I'm assuming that this is a read-only cache, and you're doing this to avoid the constant network latency. I'm also assuming that you'll do this once on start up.

How much data per record? 12M records at 1KB per record means you'll need 12GB of RAM just to hold your cache.

duffymo
isn't 12M records really not that many for a DBMS? I mean with indexing and other tricks...
hvgotcodes
Franz See
Actually, the problem is when you join it with other tables and do sorting for pagination. Our DBA has already optimized everything that could be optimized, but the amount of data to be sorted (for pagination) is just too big that it still takes 2 to 3 minutes per query.
Franz See
So are you then going to replicate the other tables in your cache too? And the logic for joining and sorting? Sounds like you're on a slippery slope to implementing your own DBMS in Java...
David Gelhar
I already have the other data. And I'm going to index them using lucene (because I need a search functionality).
Franz See
Also, I don't have to store everything in memory. Just like how the database doesn't store everything in memory, the storage where I am going to replicate the data to doesn't store everything in memory as well.
Franz See
A: 

Replicating the data in a cache seems like replicating the functionality of the database.

From reading other comments, I see that you are not doing this to avoid network roundtrips, but because of costly joins. In many DBMS you can create temporary tables - like this:

CREATE TEMPORARY TABLE abTable AS SELECT * FROM a , b ;

If a and b are large (relatively permanent) tables, then you will have a one-time cost of 2-3 minutes to create the temporary table. However, if you use abTable for many queries, then the subsequent per query cost will be much smaller than

SELECT name, city, ... , FROM a , b ;

Other database systems have a view concept which lets you do something like this

CREATE VIEW abView AS SELECT * FROM a , b ;

Changes in the underlying a and b table will be reflected in the abView.

If you really are concerned about network round trips, then you may be able to replicate parts of the database on the local computer.

A good database management system should be able to handle your data needs. So why reinvent the wheel?

emory
Pardon for the confusion again. I'm not reinventing a caching solution nor a searching solution. I just need to read the data (fast enough) from the database and store them in the cache that I'm using and index them with my searching solution. Also, although I could do the caching in the database, it would be preferable that whatever cache I'm using is horizontally scalable (which is why I'm trying to avoid the RDBMs for the caching).
Franz See
Also, if I'm not mistaken, a VIEW (unlike a Materialized VIEW) is just like a shortcut of query which means the query associated to a view would still be executed. of course, it may be faster due to in-memory caching and less disk hit, but I don't think we can rely on that to have a consistent fast query.
Franz See