views:

145

answers:

1

Currently migrating from SQL Server to PostgreSQL and attempting to improve a couple of key areas on the way:

I have an Articles table:

CREATE TABLE [dbo].[Articles](
    [server_ref] [int] NOT NULL,
    [article_ref] [int] NOT NULL,
    [article_title] [varchar](400) NOT NULL,
    [category_ref] [int] NOT NULL,
    [size] [bigint] NOT NULL
)

Data (comma delimited text files) is dumped on the import server by ~500 (out of ~1000) servers on a daily basis.

Importing:

  • Indexes are disabled on the Articles table.
  • For each dumped text file
    • Data is BULK copied to a temporary table.
    • Temporary table is updated.
    • Old data for the server is dropped from the Articles table.
    • Temporary table data is copied to Articles table.
    • Temporary table dropped.

Once this process is complete for all servers the indexes are built and the new database is copied to a web server.

I am reasonably happy with this process but there is always room for improvement as I strive for a real-time (haha!) system. Is what I am doing correct? The Articles table contains ~500 million records and is expected to grow. Searching across this table is okay but could be better. i.e. SELECT * FROM Articles WHERE server_ref=33 AND article_title LIKE '%criteria%' has been satisfactory but I want to improve the speed of searching. Obviously the "LIKE" is my problem here. Suggestions? SELECT * FROM Articles WHERE article_title LIKE '%criteria%' is horrendous.

Partitioning is a feature of SQL Server Enterprise but $$$ which is one of the many exciting prospects of PostgreSQL. What performance hit will be incurred for the import process (drop data, insert data) and building indexes? Will the database grow by a huge amount?

The database currently stands at 200 GB and will grow. Copying this across the network is not ideal but it works. I am putting thought into changing the hardware structure of the system. The thought process of having an import server and a web server is so that the import server can do the dirty work (WITHOUT indexes) while the web server (WITH indexes) can present reports. Maybe reducing the system down to one server would work to skip the copying across the network stage. This one server would have two versions of the database: one with the indexes for delivering reports and the other without for importing new data. The databases would swap daily. Thoughts?

This is a fantastic system, and believe it or not there is some method to my madness by giving it a big shake up.

UPDATE: I am not looking for help with relational databases, but hoping to bounce ideas around with data warehouse experts.

+1  A: 

I am not a data warehousing expert, but a couple of pointers.

Seems like your data can be easily partitioned. See Postgresql documentation about partitioning on how to split data into different physical tables. This lets you manage data at your natural per server granularity.

You can use postgresql transactional DDL to avoid some copying. The process will then look something like this for each input file:

  1. create a new table to store the data.
  2. use COPY to bulk load data into the table.
  3. create any necessary indexes and do any processing that is required.
  4. In a transaction drop the old partition, rename the new table and add it as a partition.

If you do it like this, you can swap out the partitions on the go if you want to. Only the last step requires locking the live table, and it's a quick DDL metadata update.

Avoid deleting and reloading data to an indexed table - that will lead to considerable table and index bloat due to the MVCC mechanism PostgreSQL uses. If you just swap out the underlying table you get a nice compact table and indexes. If you have any data locality on top of the partitioning in your queries then either order your input data on that or if that's not possible use PostgreSQL cluster functionality to reorder the data physically.

To speed up the text searches use a GIN full text index if the constraints are acceptable (can only search at word boundaries). Or a trigram index (supplied by the pg_trgm extension module) if you need to search for arbitrary substrings.

Ants Aasma
This is great. So a partition could be created for each server? Are there restrictions on the number of partitions? I remember reading in MySQL this is 1024 but cannot be sure and cannot find a figure for PostgreSQL.
youwhut
There is no actual limit on the number of partitions, but you should go much above a hundred or so due to the way Postgresql partitioning works. It's a lot more generic then it has to be, allowing any kind of partitioning you can express with SQL expressions, the down side is that when optimizing the query on the master table Postgresql isn't able to take advantage of any structure in the partitioning expressions and has to exclude every partition separately. This can cause excessive query planning time.
Ants Aasma
If you can direct the queries to the correct partition from inside the query having a huge number of tables won't be a big issue. Union across them will of course be slower than across one big table, or a smaller number partitioned on a hash function.
Ants Aasma
Okay when you say, "but you should go much above a hundred" do you actually intend, "but you should NOT go much above a hundred"?
youwhut
Yeah, I forgot to type "not" accidentally.
Ants Aasma
Okay. So my initial thought of having 500 partitions (one for each server) would be a no go...? I see this figure growing.I suppose this has provoked me to think about the data differently.Do you have any experience with hardware setups of data warehouses and a web server for reporting? I am looking for some best practice guidelines.
youwhut
It will work, but queries over all partitions will be somewhat slower than they would be over one huge table. Also planning time for queries that hit all tables will be significantly larger. This might not be an issue, test to see what kinds of results you get. As for data warehousing and reporting, the most significant departure from normal databases is precalculating commonly used aggregates. It depends heavily on what your reporting needs are.
Ants Aasma
The problem with having lots of partitions (more than 100) is that your query planner will gradually become slower. When querying on the main table it will have to be able to understand which table to select.However... you could simply solve that by selecting the correct partition from the client if you only need data from 1 partition.
WoLpH