views:

153

answers:

5

A table in our schema is already having 100 columns.We need to add some 600 more columns if we follow horizontal data storage method. If we go for vertical data storage which means creating new table and creating referential integrity with table having 100 columns, there will be issue with joining the table as the table having 100 columns is having 53 million records and new table created will be having much more than that. Which is better approach in this case.

I would like to add an interesting test case here . I added 600 column to my table already having 87 column and 53 million records. I then tried to update it in batches

a>Time taken to update 1000 records >> 2.10 secs b>Time taken to update 10000 records >> 5.57 secs c>Time taken to update 1000000 records >> 5.42 mins d>Time taken to update 53 million records >> 4. 5 hrs (the table space exhausted and we needed to extend the table space)

Can anyone suggest a faster method of update?

A: 

Highly depends on the nature of your data and how they are used.

Maybe it is appropriate to write your data to an xml document and then store the document in the db...

Thomas Weller
+1  A: 

Edit: This is actually a very interesting question, I'm curious about it now. I suggest do some real world tests, one big table vs many tables, with as much data as possible. It might be worth the extra effort! Remember, even relational DB's can hang if designed poorly and there are millions of records (I experienced this while contracted to an underwriting company, not an easy thing to fix after the fact). So your one-table design could also work - the proof's in the testing.

53 million records? I hope you're using a real relational database engine like MySQL/SQL, they are designed to handle big tables.

600+ Columns in a single table sounds like overkill to me. I assume its not a one-to-many record structure, which is why you're opting for the all-in-one-row approach? Even so it might be a better idea to have separate tables, depending on what your data is.

Wez
SQL Server and oracle can both handle at least 1000 columns per table
skaffman
Oh, ok, purging outdated constraints from mind, then.
Joey
1000 columns per table, but still a 30 character name limit..... it makes the baby jesus cry
skaffman
Hm, we have up to 27 characters in names here and I hate to type them already in queries. (Naming convention: [Table name]_[Column name], hence lots of redundancy in the names.)
Joey
+2  A: 

Questions you need to ask yourself:

  • Do most of the fields in my wide row have default or empty values? If this is the case, a vertical schema may be more suitable.
  • When you query, do you usually need to retrieve all the fields from a row, or do fields naturally classify into groups? If this is the case, a horizontal schema is likely fine, but you'll probably want to chop your main table into subtables, each with a natural group of fields, and all in a 1:1 relationship with the main table.
dnagirl
Can the use of abstract data types solve my problem here . Say reducing 600 column to only 20 column
Nishant
@Nishant: Give us some sample data and we can make better recommendations.
dnagirl
A: 

You might consider using a column oriented database, take a look at HBase (http://hadoop.apache.org/hbase/), it's a distributed, column-oriented store modeled after Google's big table.

theglump
I've never heard the term column oriented database. Is it basically every file/entity has a large number of variable attributes, kinda like CouchDB?
sheepsimulator
+1  A: 

Without offending anyone... I wonder if your data that is stored in 100 columns times 53 million records is really normalized?

If not, you really should start doing this. You could probably reduce the number of rows a lot (for instance, it could probably be split into three tables of 1000 and 1000 and 53 records. I know, it is not as easy as that, just to show how small the numbers theoretically could be). Most probably there is still a 53 million record table around after normalization, but this could hopefully be kept small, it could even only consist of foreign keys. Usually, you never need all the data at once. Ideally you could perform many queries on tables with just a few thousand records.

Don't be too afraid about joins if you normalize. At the end, it will be faster anyway. There are exceptions indeed.

Stefan Steinegger
Thnks for your comments stefan. I know the table design after adding 600 columns mocks basic concept of normalization . In our case the nature of data is such that if we go by creating different tables and joining it will create a big performance contension for us
Nishant
Did you *try*? I don't believe that your huge table performs better then a normalized database. The DBMS is *designed* for managing normalized data and joins. There indexes and other means to make it fast. I also sometimes denormalize data or add some redundancy for performance. It's always a trade-off, and it hurts somewhere else and I try to avoid it. But in this case, I think it's much too extreme and I can't believe that it works well. But - I might be wrong, I've never tired to create such a huge table.
Stefan Steinegger