views:

325

answers:

5

Say I have a few tables in the MSSQL database, each with about 5-10 attributes. There are some simple associations between the tables, but each of the table have 500,000 to 1,000,000 rows.

There is an algorithm that runs on that data (all of it), so before running the algorithm, I have to retrieve all the data from the database. The algorithm does not change the data, only reads it, so I just need to retrieve the data.

I am using LINQ to SQL. To retrieve all the data takes about two minutes. What I want to know is whether the serialization to file and then deserialization (when needed) would actually load the data faster.

The data is about 200 MB, and I don't mind saving it to disk. So, would it be faster if the objects were deserialized from the file or by using LINQ 2 SQL DataContext?

Any experiences with this?

A: 

Why not try it?

Chris Pebble
+2  A: 

You should try to use ADO.NET directly without the LINQ to SQL layer on top of it, i.e. using an SqlDataReader to read the data.

If you work sequentially with the data, you can get the records from the reader when you need them without having to read them all into memory first.

Guffa
+2  A: 

I would argue that LINQtoSQL may not be the best choice for this kind of application. When you are talking about so many objects, you incur quite some overhead creating object instances (your persistent classes).

I would choose a solution where a stored procedure retrieves only the necessary data via ADO.NET, the application stores it in memory (memory is cheap nowadays, 200MB should not be a problem) and the analyzing algorithm is run on the in-memory data.

I don't think you should store the data on file. In the end, your database is also simply one or more files that are read by the database engine. So you either

  • let the database engine read your data and you analyze it, or
  • let the database engine read your data, you write it to file, you read the file (reading the same data again, but now you do it yourself) and you analyze the data

The latter option involves a lot of overhead without any advantages as far as I can see.

EDIT: If your data changes very infrequently, you may consider preprocessing your data before analyzing and caching the preprocessed data somewhere (in the database or on the file system). This only makes sense if your preprocessed data can be analyzed (a lot) faster than the raw data. Maybe some preprocessing can be done in the database itself.

Ronald Wildenberg
The second option does make no sense, but I was thinking about loading from the database once, and then saving once to file. Then instead of loading from DB every time the program starts, load from the file. So the difference is more like loading-from-db vs. loading-from-file. It seems that LINQtoSQL objects are note serializable by default anyway, so I guess it's a lot of work and partial classes to make them serializable.
kornelijepetak
So it is more of a caching solution then? The data changes very infrequently so you can read it once and store it for a longer period of time without worrying about changes? Then you may gain some performance by caching it in a format that can be more easily analyzed by your application. I added some more suggestions to my answer.
Ronald Wildenberg
A: 

If you have a process that operates on most of the data in a database... then that sounds like a job for a stored procedure. It won't be object oriented, but it will be a lot faster and less brittle.

Marc Gravell
well, it's the genetic algorithm. I don't even think about doing that with stored procedures. :)
kornelijepetak
Indeed, that would be done better in C#/etc
Marc Gravell
A: 

Since you are doing this in C# and your database is MsSql (since you use Linq to Sql), could you not run your code in a managed stored procedure? That would allow you to keep your current code as it is, but loading the data would be much faster since the code was running in the sql server.

Rune Grimstad