views:

73

answers:

3

Ever since Microsoft introduced sql-server version code-named "Madison" the massively parallel processing (MPP) has got into picture. What exactly is it and how does sql-server is going to benefit from it ?

Further is massively parallel processing (MPP) related to parallel computing ?

I read about Madison here and about parallel computing here.

Thanks in advance.

+1  A: 

It is the ability to offload work to a different computer, not just to another core on the same computer So if you have 4 servers with 64 cores each you can tap into 256 cores

And IIRC you can't not install this yourself, you need to buy a pre configured system, what you basically get is a rack of computers

SQLMenace
I disagre with "It is the ability to offload work to a different computer, not just to another core on the same computer" - massively parallel simply means lots of cores.
Mitch Wheat
But in that case sql server 2005 and 2008 already have this also...probably more a marketing gimmick than anything else
SQLMenace
The link I mentioned in the question says the following:In a traditional, symmetric multi-processing (SMP) architecture, query processing occurs entirely within one physical instance of a database. CPU, memory and storage impose physical limits upon speed and scale.A Madison MPP data warehouse appliance partitions large tables across multiple physical nodes, each node having dedicated CPU, memory and storage, and each running its own instance of SQL Server, in a patented parallel design known as Ultra Shared Nothing™. All components are balanced against each thus reduce bottlenecks
HotTester
Madison works quite differently to base SQL server in how it shard the data for both storage and queries - it was teh result of the purchase of Data Allegro - having had the briefings, I wouldn't class it as a gimmic - but I wouldn't class it as something 99.9% of SQL Server databases ever needs consider.
Andrew
+1  A: 

The wiki entry defines massively parallel computing as:

Massive parallel processing (MPP) is a term used in computer architecture to refer to a computer system with many independent arithmetic units or entire microprocessors, that run in parallel. The term massive connotes hundreds if not thousands of such units. Early examples of such a system are the Distributed Array Processor, the Goodyear MPP, the Connection Machine, and the Ultracomputer.

SQL Server will benefit in the same way it does already, by performing certain query steps in parallel. BUT only a relatively small class of algorithms can take advantage of massively parallel computing; speed-up does not increase linearly with the addition of more cores. A good example of where it can be used, is where tables are partitioned into separately searchable silos, for example partitioning on date range.

Mitch Wheat
+1  A: 

This is basically the strategy that Teradata has used. You have dedicated server processing, memory and storage, and the data is partitioned across the processing units. Each unit has its own redundancy built in, since the data is not stored anywhere else - if you lose an AMP, you would lose the data.

In Teradata, the magic which enables the partitioning is the PRIMARY INDEX. This determines which AMP the data lives on. The query is distributed to all the AMPs and they return the data which is then combined. Performance suffers when there is skew and data needs to be redistributed from the AMP where it lives to the AMP which needs it for processing.

So the inter-process communication system, the query processor and the hash system are the key components to this kind of system.

In many cases, the massively parallel approach works well when data shares very similar primary indexes (millions of customers, millions of customer invoices, millions of customer click-stream events). This is great for a large class of problems, because things are often partitioned by customer, or by date or something similar.

It fails when you deal with things like Kimball-style star schemas or attempting to navigate a very complex 3NF model in a single query. In these cases, you are better off building intermediate temporary or volatile tables and specifying the primary index to get the data distributed well over the AMPs and matching whatever it is you are going to join on in the next join. Or remodeling your warehouse.

In MPP systems, adding capacity involves adding memory, storage and processing all at the same time, which gives fairly good scalability.

Cade Roux
interesting observation. So what all Microsoft is offering in Madison is just a fluke ?
HotTester
@HotTester - I'm not sure what you mean about a fluke, it's a fairly typical approach to a distributed architecture for RDBMS. It's not a panacea - but then at those sizes, nothing is.
Cade Roux