views:

509

answers:

10

We have a requirement to store 500 measurements per second, coming from several devices. Each measurement consists of a timestamp, a quantity type, and several vector values. Right now there is 8 vector values per measurement, and we may consider this number to be constant for needs of our prototype project. We are using HNibernate. Tests are done in SQLite (disk file db, not in-memory), but production will probably be MsSQL.

Our Measurement entity class is the one that holds a single measurement, and looks like this:

public class Measurement
{
    public virtual Guid Id { get; private set; }
    public virtual Device Device { get; private set; }
    public virtual Timestamp Timestamp { get; private set; }
    public virtual IList<VectorValue> Vectors { get; private set; }
}

Vector values are stored in a separate table, so that each of them references its parent measurement through a foreign key.

We have done a couple of things to ensure that generated SQL is (reasonably) efficient: we are using Guid.Comb for generating IDs, we are flushing around 500 items in a single transaction, ADO.Net batch size is set to 100 (I think SQLIte does not support batch updates? But it might be useful later).

The problem

Right now we can insert 150-200 measurements per second (which is not fast enough, although this is SQLite we are talking about). Looking at the generated SQL, we can see that in a single transaction we insert (as expected):

  • 1 timestamp
  • 1 measurement
  • 8 vector values

which means that we are actually doing 10x more single table inserts: 1500-2000 per second.

If we placed everything (all 8 vector values and the timestamp) into the measurement table (adding 9 dedicated columns), it seems that we could increase our insert speed up to 10 times.

Switching to SQL server will improve performance, but we would like to know if there might be a way to avoid unnecessary performance costs related to the way database is organized right now.

[Edit]

With in-memory SQLite I get around 350 items/sec (3500 single table inserts), which I believe is about as good as it gets with NHibernate (taking this post for reference: http://ayende.com/Blog/archive/2009/08/22/nhibernate-perf-tricks.aspx).

But I might as well switch to SQL server and stop assuming things, right? I will update my post as soon as I test it.

[Update]

I've moved to SQL server and flattened my hierarchy, I tested it by storing 3000 measurements/sec for several hours and it seems to be working fine.

+8  A: 

Personally, I'd say go for it: denormalize, and then create an ETL process to bring this data into a more normalized format for analysis/regular use.

Basically the ideal situation for you might be to have a separate database (or even just separate tables in the same database if need be) that treats the acquisition of data as an entirely separate matter from having it in the format in which you need to process it.

That doesn't mean that you need to throw away the entities that you've created around your current database structure: just that you should also create those denormalized tables and make an ETL to bring them in. You could use SSIS (though it's still quite buggy and irritable) to bring the data into your normalized set of tables periodically, or even a C# app or other bulk loading process.

EDIT: This is assuming, of course, that your analysis doesn't need to be done in real time: just the collection of data. Quite often, people don't need (and sometimes, would actually prefer not to have) real time updating of analysis data. It's one of those things that sounds good on paper, but in practice it's unnecessary.

If some people who analyze this data require real time access, you could build a toolset against the "bare metal" denormalized transactional data if desired: but quite frequently when you really dig into requirements, the people performing analysis don't need genuine realtime (and in some cases, they would prefer to have a more static set of data to work with!): and in that case, the periodic ETL would work quite well. You just have to get together with your target users and find out what they genuinely need.

EdgarVerona
You do want to be careful of the potential for denormalization causing inefficiencies however. In particular, be wary of constraints that could slow down inserts in these transactional tables. Keep them lean and dirty, and let the ETL process sort out data consistency issues, and let the transactional tables do what they do best: insert data as fast as it can.
EdgarVerona
Thanks. There actually is a request for real-time display of data, but there would only be queries by device or timestamp, no complex real-time querying. By *constraints that could slow down inserts*, you mean any foreign key constraints, or?
Groo
Aye. My advice would be that if you do end up having to go by this approach, you would forsake foreign key constraints and even indexes if need be. You'll want inserting to be as fast as possible, and you can give the relations it needs via the ETL.As far as the real time display of data goes, is this coming from some external device that you're creating/using... one that may already show its real time data, or that can be configured to do so? Just trying to get a better idea for it. =)
EdgarVerona
It's coming from several external devices, their purpose is to calculate the data, not display it. Maybe "real-time display" is a bit misleading: data should be available for web clients, so it's not time critical, but it should update itself every couple of seconds or so. So display performance is not so significant.
Groo
Hmm, every couple of seconds? It might be a little awkward to have an ETL in that situation, though admittedly this is where my experience reaches its current endpoint. I've dealt a lot with people who need relatively static data (aka they can go for an hour to a day without seeing new data, and in some cases they greatly desire to have this time delay for analysis purposes), but never really with 'close to real time' as it sounds like you need. Some others here may have good advice to give for this: I'll be listening intently to their answers so I can learn too. =)
EdgarVerona
A: 

Don't just denormalize. Design for results, using a useful design pattern. Sometimes, a useful design pattern for performance gives a different design than the one you get by following the normalization rules.

I don't think your situation is going to be helped by denormalizing. Almost all the people who advocate denormalizing say that the performance gains don't come when you are storing new data. They come when you retrieve data. You will have to work out how that applies to your case.

I can tell you this much. If you end up storing via multiple concurrent processes, your design will result in severe bottlenecks, and might well run slower than a normalized design.

But don't take my word for it. Experiment. Analyze. Learn. Prosper.

Walter Mitty
+3  A: 

First, move to the target database; performance based on SqlLite may not be indicative of performance based on MsSql

Second, measure where the performance bottleneck is; offhand I'd venture that it's the disk and an in-memory database would perform much better.

Then denormalize if necessary, with an ETL process as suggested above.

Event-stream processing has a saying: "if you hit the disk, you're dead." ;-)

Steven A. Lowe
Very good points, I agree with Steven that you should move to MS SQL before getting too up in arms about performance. The in memory database idea is also a good approach if there's no good reason for you to persist to disk immediately. Definitely give these a shot!
EdgarVerona
@EdgarVerona: with in-memory SQLite I get around 350 items/sec (3500 single table inserts), which I believe is about as good as it gets with NHibernate (taking this post for reference: http://ayende.com/Blog/archive/2009/08/22/nhibernate-perf-tricks.aspx). So I decided to focus my attention on optimizing the schema - although I might as well change that bloody connection string, just to keep things clear.
Groo
A: 

Yes. I would consider reducing the overhead of the inserts by both denormalization (flattening of the data) and chunking the data by time. I would design my database so that each record stores a whole second worth of data per device:

public class Measurement 
{ 
    public Guid ID { get; private set; } 
    public Device Device { get; private set; }
    public Sample[] { get; private set; }

    public DateTime FirstTimestamp { get; private set; } 
    public DateTime LastTimestamp { get; private set; } 
} 

public class Sample
{ 
    public DateTime Timestamp { get; private set; } 
    public VectorValue[] Vectors { get; private set; } 
}

There are various ways of storing complex types (such as a list of lists, in this case) in a single record. XML columns and CLR user-defined types, are two examples.

Allon Guralnek
+4  A: 

Well, it would depend. Are the 8 vector values a hard and fast number that will never change? Then denormalizing in your case could make sense (but only testing on the real hardware and database you are using will tell). If it could be 9 measurements next week, don't do it.

I would say that you need to switch first to SQL server and the equipment you will be running on before trying to decide what to do.

Once you have switched run profiler. It is entirely possible that nHibernate is not creating the best performing SQl for your insert.

That fact that you have a set of vectors which are probably being split on the insert may be part of your performance problem. It might be better to have 8 separate variables rather than a set that has to be split up.

You are talking about over 40 million records a day, this is going to require some major hardware and a very well designed database. It is also possible that a relational database is not the best choice for this (I have no idea how you want to use this amount of data). How long are you keeping this data, the size here is going to get out of hand very very quickly.

Is it possible to bulkinsert the records in a group once a minute instead? Bulk insert is faster by far than row by row inserts.

Your design has to take into consideration how you are using the data as well as inserting it. Generally things done to speed up inserts can slow down selects and vice versa. You may need a data warehouse that is loaded once a day for analysis (and a quick query to be able to show the raw up to the second data).

HLGEM
A good point as well, definitely take HLGEM's advice on that: if the number of measurements could change, this whole situation could get really messy really quickly.
EdgarVerona
+1  A: 

"We have a requirement to store 500 measurements per second, coming from several devices."

Don't use DBMS's to store that kind of data.

What are the reasons people use DBMS's ?

(a) They can enforce constraints for you on the data you are trying to register. But you don't have any. The measurements data are what they are and they need to be accepted. No constraints.

(b) They can ensure consistency and integrity of your precious business data in the case of (1) constraint violations and (2) severe system failures such as disk I/O errors. But since you don't have constraints, (1) doesn't apply. And as for (2), what would you do with your measurements if a disk I/O error prevents it from being recorded ? Your measurements are lost no matter what.

So imo, you don't have any reason what so ever to use a DBMS. Dump your load of measurements in a flat file and process that as needed.

Erwin Smout
He needs that data to be (1) shared with a web service in near real-time and (2) be queryable. A simple flat file is not a great solution since (1) it is not thread safe and (2) it is not queryable or searchable - no indexes.
Allon Guralnek
+1  A: 

You might consider other Database alternatives. MSSQL provides a lot of functionality, but that adds some overhead.

An excellent resource for high performance processing (like what you are trying to do) is at http://highscalability.com/

One of the case studies that they had is storing thousands device statistics in a database. The solution was multiple MYSQL databases and route a request based on the device ID. Overall - the site can provide excellent case studies. May be you can find a possible solution there.

TImur

Timur Fanshteyn
Thanks, there seem to be some interesting articles there, I'll take a look.
Groo
A: 

Use the right DBMS and hardware. Testing on another platform with different hardware will tell you nothing about performance.

Denormalization is unlikely to help write performance because by definition it means you are creating redundant data and therefore you would be doing more work for each write, not less.

The figures you have quoted are not exceptional for streaming data scenarios and perfectly achievable using the right hardware but I think nHibernate is going to be a major limiting factor for you. I think it's unlikely nHib is a sensible choice for this kind of thing.

Have you considered using some of the technologies which provide special suport for streaming data sources and CEP? For example: OSISoft PI, Microsoft StreamInsight and SQL Server's filestream feature.

dportas
+1  A: 

You have to ask yourself, "why do we normalize?"

There are three primary reasons:

  1. Data consistency
  2. Update Speed
  3. Size

Data Consistency

It's nice having drop downs and all the rows that mean the same thing having the same FK, right? Pretty obvious. This is really important for DB's with multiple data "editors". But this is only as good as our processes. Let's say it's a Flight database and there's an entry for National Airport in Washington DC... and some adds a NEW entry for Reagan National Airport in Washington DC... the FK's will be there, and be used in the children table but won't be worth much... But it's still a good thing to do so...

Update Speed

What we should have done is Update the row for National Airport with a new name. Because there's just one parent row, makes it a very simple change. If my flight table had the text I would have been updating millions of rows.

Size

If I did store "Reagan National Airport" on every record, it would take more space than an FK of say, 19. Size used to be a really big deal, but SAN makes it pretty irrelevant.


Conclussions

Ok, So are you worried that your SOLO data collection app can't keep the names of the instruments straight? Is data consistency going to be a challenge?

Ok, So How many times do you think you'll change the name of instrument or data point? I mean Dissolved O2 is Dissolved O2, Turbidity is Turbidity, right? But If you did need to do a mass update I bet you'll have downtime between runs to do it. So this isn't an issue.

Ok, So size, sure... that's a lot of measurements; but, Don't make the measurement "Dissolved Oxygen", DO2 is fine... how much bigger is that than some FK like "7? Spend the space to save time.

Don't Normalize because you've always been told that the thing good database designers do. Know why you're doing it and why you're choosing what you're choosing.

Stephanie Page
+2  A: 

Have you considered using SqlBulkCopy? It works really fast. I've used it in a production environment and achieved 10.000+ inserts on a single table less than a second with a sql server 2005 machine. You just need to prepare DataTable(s) to be inserted bulk in your application. Here is a sample.

        public static void SQLBulkCopyInsert(DataTable dtInsertRows, string destinationTableName, string[] columnMappings)
    {
        using (SqlBulkCopy sbc = new SqlBulkCopy(DBHelper.Secim2009DB.ConnectionString, SqlBulkCopyOptions.UseInternalTransaction))
        {                
            sbc.DestinationTableName = destinationTableName;
            // Number of records to be processed in one go
            sbc.BatchSize = 30000;
            // Map the Source Column from DataTabel to the Destination Columns in SQL Server 2005 Person Table

            foreach (string columnMapping in columnMappings)
            {
                sbc.ColumnMappings.Add(columnMapping, columnMapping);
            }

            // Number of records after which client has to be notified about its status
            sbc.NotifyAfter = dtInsertRows.Rows.Count;
            // Event that gets fired when NotifyAfter number of records are processed.
            sbc.SqlRowsCopied += new SqlRowsCopiedEventHandler(sbc_SqlRowsCopied);
            // Finally write to server
            sbc.WriteToServer(dtInsertRows);
            sbc.Close();
        }
    }

    public static void sbc_SqlRowsCopied(object sender, SqlRowsCopiedEventArgs e)
    {            

    }
hakan
Thanks, I was aware of this but since I am using NHibernate I actually wanted to avoid using DataTables or coding SQL by hand. From what I've seen, it does provide substantial performance benefits, but if I can satisfy my requirements with NHibernate, then I'll stick to that. Here is the actual CodeProject link: http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx
Groo
@Groo, you shouldn't use normal sql (with or without an ORM) for any operations that are bulk in nature, the best solution is the one given by hakan in this scenario, because SqlBulkCopy was designed specifically to solve the king of issues you're facing, and in an production environment you need to make sure you have some headroom (IE, the limit won't be reached for a while). What happens if someone one day decides to attach 3 devices or more to the system and suddenly you have to do 2000 or more inserts per second?
Pop Catalin
@Pop Catalin: Actually, this is a demonstration project where number of devices is known and fixed, so 500 inserts per seconds is the maximum theoretical number of inserts for all devices, which will not actually be reached. Right now with NH+MSSQL I can do 3000/s (and potentially more, since that is where I stopped simulating). And since we have a pretty short deadline, I didn't want to experiment too much, if tools I am comfortable with will do well. NHibernate turned out to be adequate for this phase, we have abstracted our limited set of commands and will easily replace it if necessary.
Groo