ansaurus

Question

SQL Server to PostgreSQL - Migration and design concerns

Answer 1

+1 A:

I am not a data warehousing expert, but a couple of pointers.

Seems like your data can be easily partitioned. See Postgresql documentation about partitioning on how to split data into different physical tables. This lets you manage data at your natural per server granularity.

You can use postgresql transactional DDL to avoid some copying. The process will then look something like this for each input file:

create a new table to store the data.
use COPY to bulk load data into the table.
create any necessary indexes and do any processing that is required.
In a transaction drop the old partition, rename the new table and add it as a partition.

If you do it like this, you can swap out the partitions on the go if you want to. Only the last step requires locking the live table, and it's a quick DDL metadata update.

Avoid deleting and reloading data to an indexed table - that will lead to considerable table and index bloat due to the MVCC mechanism PostgreSQL uses. If you just swap out the underlying table you get a nice compact table and indexes. If you have any data locality on top of the partitioning in your queries then either order your input data on that or if that's not possible use PostgreSQL cluster functionality to reorder the data physically.

To speed up the text searches use a GIN full text index if the constraints are acceptable (can only search at word boundaries). Or a trigram index (supplied by the pg_trgm extension module) if you need to search for arbitrary substrings.

Ants Aasma 2009-10-27 12:54:47

This is great. So a partition could be created for each server? Are there restrictions on the number of partitions? I remember reading in MySQL this is 1024 but cannot be sure and cannot find a figure for PostgreSQL.

youwhut 2009-10-27 16:30:09

There is no actual limit on the number of partitions, but you should go much above a hundred or so due to the way Postgresql partitioning works. It's a lot more generic then it has to be, allowing any kind of partitioning you can express with SQL expressions, the down side is that when optimizing the query on the master table Postgresql isn't able to take advantage of any structure in the partitioning expressions and has to exclude every partition separately. This can cause excessive query planning time.

Ants Aasma 2009-10-27 18:56:00

If you can direct the queries to the correct partition from inside the query having a huge number of tables won't be a big issue. Union across them will of course be slower than across one big table, or a smaller number partitioned on a hash function.

Ants Aasma 2009-10-27 18:58:29

Okay when you say, "but you should go much above a hundred" do you actually intend, "but you should NOT go much above a hundred"?

youwhut 2009-10-28 09:51:40

Yeah, I forgot to type "not" accidentally.

Ants Aasma 2009-10-28 19:23:02

Okay. So my initial thought of having 500 partitions (one for each server) would be a no go...? I see this figure growing.I suppose this has provoked me to think about the data differently.Do you have any experience with hardware setups of data warehouses and a web server for reporting? I am looking for some best practice guidelines.

youwhut 2009-10-29 13:56:26

It will work, but queries over all partitions will be somewhat slower than they would be over one huge table. Also planning time for queries that hit all tables will be significantly larger. This might not be an issue, test to see what kinds of results you get. As for data warehousing and reporting, the most significant departure from normal databases is precalculating commonly used aggregates. It depends heavily on what your reporting needs are.

Ants Aasma 2009-10-29 14:27:58

The problem with having lots of partitions (more than 100) is that your query planner will gradually become slower. When querying on the main table it will have to be able to understand which table to select.However... you could simply solve that by selecting the correct partition from the client if you only need data from 1 partition.

WoLpH 2010-03-21 00:51:45

ansaurus

tags:

views:

answers:

SQL Server to PostgreSQL - Migration and design concerns

related questions