ansaurus

Question

Should we denormalize database to improve performance?

Answer 1

+8 A:

Personally, I'd say go for it: denormalize, and then create an ETL process to bring this data into a more normalized format for analysis/regular use.

Basically the ideal situation for you might be to have a separate database (or even just separate tables in the same database if need be) that treats the acquisition of data as an entirely separate matter from having it in the format in which you need to process it.

That doesn't mean that you need to throw away the entities that you've created around your current database structure: just that you should also create those denormalized tables and make an ETL to bring them in. You could use SSIS (though it's still quite buggy and irritable) to bring the data into your normalized set of tables periodically, or even a C# app or other bulk loading process.

EDIT: This is assuming, of course, that your analysis doesn't need to be done in real time: just the collection of data. Quite often, people don't need (and sometimes, would actually prefer not to have) real time updating of analysis data. It's one of those things that sounds good on paper, but in practice it's unnecessary.

If some people who analyze this data require real time access, you could build a toolset against the "bare metal" denormalized transactional data if desired: but quite frequently when you really dig into requirements, the people performing analysis don't need genuine realtime (and in some cases, they would prefer to have a more static set of data to work with!): and in that case, the periodic ETL would work quite well. You just have to get together with your target users and find out what they genuinely need.

EdgarVerona 2010-05-03 11:16:41

You do want to be careful of the potential for denormalization causing inefficiencies however. In particular, be wary of constraints that could slow down inserts in these transactional tables. Keep them lean and dirty, and let the ETL process sort out data consistency issues, and let the transactional tables do what they do best: insert data as fast as it can.

EdgarVerona 2010-05-03 11:25:28

Thanks. There actually is a request for real-time display of data, but there would only be queries by device or timestamp, no complex real-time querying. By *constraints that could slow down inserts*, you mean any foreign key constraints, or?

Groo 2010-05-03 11:42:51

Aye. My advice would be that if you do end up having to go by this approach, you would forsake foreign key constraints and even indexes if need be. You'll want inserting to be as fast as possible, and you can give the relations it needs via the ETL.As far as the real time display of data goes, is this coming from some external device that you're creating/using... one that may already show its real time data, or that can be configured to do so? Just trying to get a better idea for it. =)

EdgarVerona 2010-05-03 12:30:21

It's coming from several external devices, their purpose is to calculate the data, not display it. Maybe "real-time display" is a bit misleading: data should be available for web clients, so it's not time critical, but it should update itself every couple of seconds or so. So display performance is not so significant.

Groo 2010-05-03 13:08:33

Hmm, every couple of seconds? It might be a little awkward to have an ETL in that situation, though admittedly this is where my experience reaches its current endpoint. I've dealt a lot with people who need relatively static data (aka they can go for an hour to a day without seeing new data, and in some cases they greatly desire to have this time delay for analysis purposes), but never really with 'close to real time' as it sounds like you need. Some others here may have good advice to give for this: I'll be listening intently to their answers so I can learn too. =)

EdgarVerona 2010-05-04 00:50:53

Answer 2

A:

Don't just denormalize. Design for results, using a useful design pattern. Sometimes, a useful design pattern for performance gives a different design than the one you get by following the normalization rules.

I don't think your situation is going to be helped by denormalizing. Almost all the people who advocate denormalizing say that the performance gains don't come when you are storing new data. They come when you retrieve data. You will have to work out how that applies to your case.

I can tell you this much. If you end up storing via multiple concurrent processes, your design will result in severe bottlenecks, and might well run slower than a normalized design.

But don't take my word for it. Experiment. Analyze. Learn. Prosper.

Walter Mitty 2010-05-03 11:18:06

Answer 3

+3 A:

First, move to the target database; performance based on SqlLite may not be indicative of performance based on MsSql

Second, measure where the performance bottleneck is; offhand I'd venture that it's the disk and an in-memory database would perform much better.

Then denormalize if necessary, with an ETL process as suggested above.

Event-stream processing has a saying: "if you hit the disk, you're dead." ;-)

Steven A. Lowe 2010-05-03 11:38:32

Very good points, I agree with Steven that you should move to MS SQL before getting too up in arms about performance. The in memory database idea is also a good approach if there's no good reason for you to persist to disk immediately. Definitely give these a shot!

EdgarVerona 2010-05-03 11:43:41

@EdgarVerona: with in-memory SQLite I get around 350 items/sec (3500 single table inserts), which I believe is about as good as it gets with NHibernate (taking this post for reference: http://ayende.com/Blog/archive/2009/08/22/nhibernate-perf-tricks.aspx). So I decided to focus my attention on optimizing the schema - although I might as well change that bloody connection string, just to keep things clear.

Groo 2010-05-03 11:55:24

Answer 4

A:

Yes. I would consider reducing the overhead of the inserts by both denormalization (flattening of the data) and chunking the data by time. I would design my database so that each record stores a whole second worth of data per device:

public class Measurement 
{ 
    public Guid ID { get; private set; } 
    public Device Device { get; private set; }
    public Sample[] { get; private set; }

    public DateTime FirstTimestamp { get; private set; } 
    public DateTime LastTimestamp { get; private set; } 
} 

public class Sample
{ 
    public DateTime Timestamp { get; private set; } 
    public VectorValue[] Vectors { get; private set; } 
}

There are various ways of storing complex types (such as a list of lists, in this case) in a single record. XML columns and CLR user-defined types, are two examples.

Allon Guralnek 2010-05-03 14:31:18

Answer 5

+4 A:

Well, it would depend. Are the 8 vector values a hard and fast number that will never change? Then denormalizing in your case could make sense (but only testing on the real hardware and database you are using will tell). If it could be 9 measurements next week, don't do it.

I would say that you need to switch first to SQL server and the equipment you will be running on before trying to decide what to do.

Once you have switched run profiler. It is entirely possible that nHibernate is not creating the best performing SQl for your insert.

That fact that you have a set of vectors which are probably being split on the insert may be part of your performance problem. It might be better to have 8 separate variables rather than a set that has to be split up.

You are talking about over 40 million records a day, this is going to require some major hardware and a very well designed database. It is also possible that a relational database is not the best choice for this (I have no idea how you want to use this amount of data). How long are you keeping this data, the size here is going to get out of hand very very quickly.

Is it possible to bulkinsert the records in a group once a minute instead? Bulk insert is faster by far than row by row inserts.

Your design has to take into consideration how you are using the data as well as inserting it. Generally things done to speed up inserts can slow down selects and vice versa. You may need a data warehouse that is loaded once a day for analysis (and a quick query to be able to show the raw up to the second data).

HLGEM 2010-05-03 17:55:37

A good point as well, definitely take HLGEM's advice on that: if the number of measurements could change, this whole situation could get really messy really quickly.

EdgarVerona 2010-05-04 00:47:51

Answer 6

+1 A:

"We have a requirement to store 500 measurements per second, coming from several devices."

Don't use DBMS's to store that kind of data.

What are the reasons people use DBMS's ?

(a) They can enforce constraints for you on the data you are trying to register. But you don't have any. The measurements data are what they are and they need to be accepted. No constraints.

(b) They can ensure consistency and integrity of your precious business data in the case of (1) constraint violations and (2) severe system failures such as disk I/O errors. But since you don't have constraints, (1) doesn't apply. And as for (2), what would you do with your measurements if a disk I/O error prevents it from being recorded ? Your measurements are lost no matter what.

So imo, you don't have any reason what so ever to use a DBMS. Dump your load of measurements in a flat file and process that as needed.

Erwin Smout 2010-05-03 20:48:38

He needs that data to be (1) shared with a web service in near real-time and (2) be queryable. A simple flat file is not a great solution since (1) it is not thread safe and (2) it is not queryable or searchable - no indexes.

Allon Guralnek 2010-05-04 07:30:08

Answer 7

+1 A:

You might consider other Database alternatives. MSSQL provides a lot of functionality, but that adds some overhead.

An excellent resource for high performance processing (like what you are trying to do) is at http://highscalability.com/

One of the case studies that they had is storing thousands device statistics in a database. The solution was multiple MYSQL databases and route a request based on the device ID. Overall - the site can provide excellent case studies. May be you can find a possible solution there.

TImur

Timur Fanshteyn 2010-05-04 01:40:16

Thanks, there seem to be some interesting articles there, I'll take a look.

Groo 2010-05-06 17:21:36

Answer 8

A:

Use the right DBMS and hardware. Testing on another platform with different hardware will tell you nothing about performance.

Denormalization is unlikely to help write performance because by definition it means you are creating redundant data and therefore you would be doing more work for each write, not less.

The figures you have quoted are not exceptional for streaming data scenarios and perfectly achievable using the right hardware but I think nHibernate is going to be a major limiting factor for you. I think it's unlikely nHib is a sensible choice for this kind of thing.

Have you considered using some of the technologies which provide special suport for streaming data sources and CEP? For example: OSISoft PI, Microsoft StreamInsight and SQL Server's filestream feature.

dportas 2010-05-04 10:20:45

Answer 9

+1 A:

You have to ask yourself, "why do we normalize?"

There are three primary reasons:

Data consistency
Update Speed
Size

Data Consistency

It's nice having drop downs and all the rows that mean the same thing having the same FK, right? Pretty obvious. This is really important for DB's with multiple data "editors". But this is only as good as our processes. Let's say it's a Flight database and there's an entry for National Airport in Washington DC... and some adds a NEW entry for Reagan National Airport in Washington DC... the FK's will be there, and be used in the children table but won't be worth much... But it's still a good thing to do so...

Update Speed

What we should have done is Update the row for National Airport with a new name. Because there's just one parent row, makes it a very simple change. If my flight table had the text I would have been updating millions of rows.

Size

If I did store "Reagan National Airport" on every record, it would take more space than an FK of say, 19. Size used to be a really big deal, but SAN makes it pretty irrelevant.

Conclussions

Ok, So are you worried that your SOLO data collection app can't keep the names of the instruments straight? Is data consistency going to be a challenge?

Ok, So How many times do you think you'll change the name of instrument or data point? I mean Dissolved O2 is Dissolved O2, Turbidity is Turbidity, right? But If you did need to do a mass update I bet you'll have downtime between runs to do it. So this isn't an issue.

Ok, So size, sure... that's a lot of measurements; but, Don't make the measurement "Dissolved Oxygen", DO2 is fine... how much bigger is that than some FK like "7? Spend the space to save time.

Don't Normalize because you've always been told that the thing good database designers do. Know why you're doing it and why you're choosing what you're choosing.

Stephanie Page 2010-05-04 14:36:35

Answer 10

+2 A:

Have you considered using SqlBulkCopy? It works really fast. I've used it in a production environment and achieved 10.000+ inserts on a single table less than a second with a sql server 2005 machine. You just need to prepare DataTable(s) to be inserted bulk in your application. Here is a sample.

        public static void SQLBulkCopyInsert(DataTable dtInsertRows, string destinationTableName, string[] columnMappings)
    {
        using (SqlBulkCopy sbc = new SqlBulkCopy(DBHelper.Secim2009DB.ConnectionString, SqlBulkCopyOptions.UseInternalTransaction))
        {                
            sbc.DestinationTableName = destinationTableName;
            // Number of records to be processed in one go
            sbc.BatchSize = 30000;
            // Map the Source Column from DataTabel to the Destination Columns in SQL Server 2005 Person Table

            foreach (string columnMapping in columnMappings)
            {
                sbc.ColumnMappings.Add(columnMapping, columnMapping);
            }

            // Number of records after which client has to be notified about its status
            sbc.NotifyAfter = dtInsertRows.Rows.Count;
            // Event that gets fired when NotifyAfter number of records are processed.
            sbc.SqlRowsCopied += new SqlRowsCopiedEventHandler(sbc_SqlRowsCopied);
            // Finally write to server
            sbc.WriteToServer(dtInsertRows);
            sbc.Close();
        }
    }

    public static void sbc_SqlRowsCopied(object sender, SqlRowsCopiedEventArgs e)
    {            

    }

hakan 2010-05-05 14:00:19

Thanks, I was aware of this but since I am using NHibernate I actually wanted to avoid using DataTables or coding SQL by hand. From what I've seen, it does provide substantial performance benefits, but if I can satisfy my requirements with NHibernate, then I'll stick to that. Here is the actual CodeProject link: http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx

Groo 2010-05-06 17:26:03

@Groo, you shouldn't use normal sql (with or without an ORM) for any operations that are bulk in nature, the best solution is the one given by hakan in this scenario, because SqlBulkCopy was designed specifically to solve the king of issues you're facing, and in an production environment you need to make sure you have some headroom (IE, the limit won't be reached for a while). What happens if someone one day decides to attach 3 devices or more to the system and suddenly you have to do 2000 or more inserts per second?

Pop Catalin 2010-05-06 17:48:26

@Pop Catalin: Actually, this is a demonstration project where number of devices is known and fixed, so 500 inserts per seconds is the maximum theoretical number of inserts for all devices, which will not actually be reached. Right now with NH+MSSQL I can do 3000/s (and potentially more, since that is where I stopped simulating). And since we have a pretty short deadline, I didn't want to experiment too much, if tools I am comfortable with will do well. NHibernate turned out to be adequate for this phase, we have abstracted our limited set of commands and will easily replace it if necessary.

Groo 2010-05-06 18:41:55

ansaurus

tags:

views:

answers:

Should we denormalize database to improve performance?

Data Consistency

Update Speed

Size

Conclussions

related questions