views:

407

answers:

12

Hey all,

I'm not much of a database guru so I would like some advice.

Background

We have 4 tables that are currently stored in Sybase IQ. We don't currently have any choice over this, we're basically stuck with what someone else decided for us. Sybase IQ is a column-oriented database that is perfect for a data warehouse. Unfortunately, my project needs to do a lot of transactional updating (we're more of an operational database) so I'm looking for more mainstream alternatives.

Question

  1. Given these tables' dimensions, would anyone consider SQL Server or Oracle to be a viable alternative?

    • Table 1 : 172 columns * 32 million rows
    • Table 2 : 453 columns * 7 million rows
    • Table 3 : 112 columns * 13 million rows
    • Table 4 : 147 columns * 2.5 million rows
  2. Given the size of data what are the things I should be concerned about in terms of database choice, server configuration, memory, platform, etc.?

+2  A: 

With suitable sized hardware and I/O subsystem to meet your demands both are quite adequate - Wihlst you have a lot of columns the row counts are really very low - we regularily use datasets that are expressed in billions, not millions. (Just do not try it on SQL 2000 :) )

If you know your usages and I/O requirements, most I/O vendors will translate that into hardware specs for you. Memory, processors etc again is dependant on workloads that only you can model.

Andrew
Thank you, I figured workload was kind of subjective but threw it out there anyway ... just in case!
Jeffrey Cameron
A: 

Nope.

SQL Server or Oracle would be fine. It will come down to your hardware setup.

kevchadders
+7  A: 

Yes, both should be able to handle your tables (if your server is suited for it). But, I would consider redesigning your database a bit. Even in a datawarehouse where you denormalize your data, a table with 453 columns is not normal.

Maximilian Mayerl
Believe it or not the data is normalized! This is Census data and the tables that indicate people, for instance, have a lot of variables on people. We do further break down the data based on particular subject matter (in other tables) but that isn't always a clean cut for us. Thanks for the advice however!
Jeffrey Cameron
For a *column oriented* database as Sybase IQ this is not a problem.
Remus Rusanu
It's a "rule of thumb" (thus: there's always exceptions, e.g. possibly Cameron's case) that if your table has so many columns (e.g. >30) then it probably represents more than one type of entity. For example, in Census data I'd wonder whether all those columns are always non-null for every person? Perhaps there are subsets of people for whom some columns are inapplicable? If so these could be moved to separate tables. I'm not saying this MUST happen, just a suggestion.
Jeffrey Kemp
@JeffreyKemp: We do distinguish between people who get a long form and a short form and have two "people" databases for that (the 453 and 172 column tables) and do the same thing with households (the remaining two tables). This prevents nulls for all values. Believe it or not though, at the end of processing every one of those cells has something in it!
Jeffrey Cameron
+1  A: 

Hi,

Oracle 11g has no problems with such data and structure.

More info at: http://neworacledba.blogspot.com/2008/05/database-limits.html

Regards.

ATorras
+4  A: 

It really depends on what's in the columns. If there are lots of big VARCHAR columns -- and they are frequently filled to near capacity -- then you could be in for some problems. If it's all integer data then you should be fine.

453 * 4 = 1812      # columns are 4 byte integers, row size is ~1.8k
453 * 255 = 115,515 # columns are VARCHAR(255), theoretical row size is ~112k

The rule of thumb is that row size should not exceed the disk block size, which is generally 8k. As you can see, your big table is not a problem in this regard if it consists entirely of 4-byte integers but if it consists of 255-char VARCHAR columns then you could be exceeding the limit substantially. This 8k limit used to be a hard limit in SQL Server but I think these days it's just a soft limit and performance guideline.

Note that VARCHAR columns don't necessarily consume memory commensurate with the size you specify for them. That is the max size, but they only consume as much as they need. If the actual data in the VARCHAR columns is always 3-4 chars long then size will be similar to that of integer columns regardless of whether you created them as VARCHAR(4) or VARCHAR(255).

The general rule is that you want row size to be small so that there are many rows per disk block, this reduces the number of disk reads necessary to scan the table. Once you get above 8k you have two reads per row.

Oracle has another potential problem which is that ANSI joins have a hard limit on the total number of columns in all tables in the join. You can avoid this by avoiding the Oracle ANSI join syntax. (There are equivalents that don't suffer from this bug.) I don't recall what the limit is or which versions it applies to (I don't think it's been fixed yet).

The numbers of rows you're talking about should be no problem at all, presuming you have adequate hardware.

Nate C-K
Very useful answer! Thank you
Jeffrey Cameron
A: 

If you are saying your table has already been normalized then I would probably consider breaking it up into separate tables rather than having so many columns in each table.

James
+1  A: 

Oracle limitations

SQL Server limitations

You might be close on SQL Server, depending on what data types you have in that 453 column table (note the bytes per row limitation, but also read the footnote). I know you said that this is normalized, but I suggest looking at your workflow and considering ways of reducing the column count.

Also, these tables are big enough that hardware considerations are a major issue with performance. You'll need an experienced DBA to help you spec and set up the server with either RDBMS. Properly configuring your disk subsystem will be vital. You will probably also want to consider table partitioning among other things to help with performance, but this all depends on exactly how the data is being used.

Donnie
A: 

Are all of the columns in all of those tables updated by your application?

You could consider having data marts (AKA operational or online data store) that are updated during the day, and then the new records are migrated into the main warehouse at night? I say this because rows with massive amounts of columns are going to be slower to insert and update, so you may want to consider tailoring your specific online architecture to your application's update requirements.

James B
No, we often update only a handful of columns at a time.
Jeffrey Cameron
If that's the case then an online datastore/data mart for quicker updates might be the way to go, then you have the bonus of having the weight of data warehousing theory behind your design decision, and a looong history of ETL tools and data modelling techniques that you can read up on and apply to your architecture (and it would be familiar to others looking at it afresh).I would say that the choice of database vendor shouldn't be decided until you have a rough idea of the architecture you will employ.
James B
A: 

Asking one DB to act as an operational and warehouse system at the same time is still a bit of a tall order. I would consider using SQL server or Oracle for operational system and having a separate DW for reporting and analytic, probably keeping the system you have.

Expect some table re-design and normalization to happen on the operational side to fit one-row per page limitations of row-based storage.

If you need to have fast updates of the DW, you may consider EP for ETL approach, as opposed to standard (scheduled) ETL.

Considering that you are in the early stage of this, take a look at Microsoft project Madison, which is auto-scalable DW appliance up to 100s TB. They have already shipped some installations.

Damir Sudarevic
A: 

I would very carefully consider switching from a column oriented database to a relational one. Column oriented databases are indeed inadequate for operational work as updates are very slow, but they are more than adequate for reporting and business intelligence support.

More often than not one has to split the operational work into a OLTP database containing the current activity needed for operations (accounts, inventory etc) and use an ETL process to populate the data warehouse (history, trends). A column oriented DW will beat hands down a relational one in almost any circumstance, so I wouldn't give up the Sybase IQ so easily. Perhaps you can design your system to have an operational OLTP side using your relational product of choice (I would choose SQL Server, but I'm biased) and keep the OLAP part you have now.

Remus Rusanu
This is a good thought, thanks. I don't think the increased speed of using a column-oriented database would trump the efficiency (in toolset alone not to mention slower update speed!) of using a more frequently used database.
Jeffrey Cameron
+1  A: 

Based on your comments in the other answers I think what I'd recommend is:

1) Isolate which data is actually updated vs. which data is more or less read only (or infrequently) 2) Move the updated data to separate tables joined on an id to the bigger tables (deleting those columns from the big tables) 3) Do your OLTP transactions against the smaller, more relational tables 4) Use inner joins to hook back up to the big tables to retrieve data when necessary.

As others have noted you are trying to make the DB do both OLTP and OLAP at the same time and that is difficult. Server settings need to be tweaked differently for either scenario.

Either SQL Server or Oracle should work. I use census data as well and my giganto table has around 300+ columns. I use SQL Server 2005 and it complains that if all the columns were to be filled to their capacity it would exceed that max possible size for a record. We use our census data in an OLAP fashion, so it isn't such a big deal to have so many columns.

jeffa00
interesting, thanks!
Jeffrey Cameron
A: 

Sybase have a product called RAP that combines IQ with an in-memory instance of ASE (their relational database) which is designed to help in situations such as this.

Your data isn't so vast that you couldn't consider moving to a row-oriented database but, depending on the structure of the data, you could end up using considerably more disk space and slowing down many kinds of queries.

Disclaimer: I do work for Sybase but not currently on the ASE/IQ/RAP side.

Stephen Darlington