views:

221

answers:

9

I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.

I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.

What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.

Edit: I am getting so many nice suggestions, but none have addressed the method.

+5  A: 

One huge table - can this be normalised at all?

If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?

From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.

Paddy
Think of the current database structure as perfect for my purpose. I am asking about whether the method would work.
David
Not sure to be honest, seems like a fairly large question for quite a small description. Without really knowing your requirements, it does sound like some kind of OLAP solution may help to speed up your queries (as it does a lot of this aggregate work prior to being queried), without re-inventing the wheel.
Paddy
+1  A: 

What kind of indexing are you using?

HLGEM
is this an answer?
Tim Drisdelle
+1 for raising the issue of indexing but this should be a comment not an answer.
Andi
I am not asking about optimizing the performance, but a method for scaling out.
David
Indexes can help scalability, too ...
meriton
+4  A: 

By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?

grigory
Yes I have looked into it, it won't help me in my case.
David
It's just that partitioning looks compatible with what you are trying to do...
grigory
+1  A: 

My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)

Jacob G
I have seen near linear performance increase for each added node using GridSQL.
David
+5  A: 

It's a data volume problem, not necessarily an architecture problem.

Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.

Rather than normalizing you data, you need to de-normalize it.

You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.

So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.

But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".

Edit regarding comment:

I am not familiar with "GridSQL", or what it does.

If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.

The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.

That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.

I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).

But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.

For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.

So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.

Will Hartung
My data is denormalized, which is why I only have one single table. I did not want to get into the denormalizing/normalizing discussion, so I did not provide any such details on purpose.The single table is the fact table in a star schema, containing only numbers and IDs. All joins to replace IDs with text are made after the final resultset has finished. The performance of this part is excellent.I have seen near linear performance increase for each added node using GridSQL so 1000 machines does help.
David
Mostly I only get about 20 rows from each node, so it would be 20*1000, which is not that bad, but you do have a valid point, when we are talking 1000 nodes.My problem is that I am actually not writing the SQL queries myself, they are written in a Graphing Tool. In addition, it will not be me using the graphing tool, but rather my customers, and I basically have no idea what sort of queries they will end up running. Creating CUBES will limit them, and reduce the value of my product.
David
+1  A: 

David,

Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.

Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.

If you monitor the servers during execution, is the bottleneck IO or CPU?

Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.

Mason
Normally my queries are SELECT 15 different columns FROM my_one_and_only_table GROUP BY different_columns_from_time_to_time.My WHERE clauses mostly do not exclude a lot of rows. I wouldn't know how I could benefit from constraint exclusion partitioning in my case.I believe my bottleneck is CPU, but I will look into it when I am using several nodes.Tableau, unfortunatelly, does not have the ability to drill down into the fact table as needed. Also, it is not normally me using Tableau, it is my customers and they will not understand that they have to connect to different cubes/tables.
David
+2  A: 

I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as @HLGEM mentioned, have you addressed indexing?

Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?

I also agree fully with @Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.

Spence
A: 

I think Google already solved your problem.

BigTable, MapReduce ring a bell ?

Marcel
One cannot utilize SQL to query those engines.
David
of course you cannot - because you don't have access to them!but you do however have access to Hadoop and Hbase as open source alternatives that can work with Pig (which has a certain resemblance to SQL) don't you think ?
Marcel
+1  A: 

If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?

I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.

Larry Lustig
I forgot to say that the data should be partitioned on each node. Thanks for pointing out. I my case nearly all rows contribute to the final result.
David