views:

58

answers:

2

This is sort of the next step of the LINQ to DB2 question I asked here.

Following zb_z's answer, I poked around a bit with the code for DB_Linq and have managed to add working DB2 support. (It's still in its infancy now, not ready to be contributed back to the project yet.) The proof of concept worked great, it was pretty exciting actually. However, I've run into another hiccup along the way.

As it turns out, our DB2 database is big. 8,306 tables big. So the code that was generated turned out to be over 5.2 million lines of code. In one file. Needless to say, Visual Studio didn't much care for it :)

So I further modified the generator to spit out each table class into its own file. This left me with 8,307 files (the data context and one for each table, which extend the data context with table properties). Visual Studio still didn't like it, understandably, so I wrapped up the code generation and compilation in a script and just run that to output a DLL for my projects to use.

A 36 MB DLL.

Now, searching around a bit on performance led me to this SO question (which itself references this one) and I've followed the answers and the links and see what they're saying. So this leads me to wonder if its perhaps the existence of over 8,000 classes within the same namespace that's the culprit of noticeable performance issues.

My test for performance was to write a little console app that initializes the data context, grabs the data with LINQ, prints out a row count, grabs the data with classic ADO, and prints out another row count. Each statement includes a time stamp. Adding more queries to test, etc. always results in the same performance. The LINQ code takes several seconds to run, while the ADO fills the dataset in the blink of an eye.

So I guess this ends up being a somewhat open-ended (and long-winded, sorry about that) question. Does anybody have any ideas on speeding up performance here? Anything simple to tweak, or design considerations I could apply?

EDIT

A few things to note:

  1. If I restrict the code generation to a subset of tables (say, 200) then it runs much faster.
  2. Stepping through in the debugger, the length of time is spent on the line var foo = from t in bank1.TMX9800F where t.T9ADDEP > 0 select t.T9ADDEP and when I expand the property in the debugger to enumerate the results (or let it go to the next line which does a .Count()) then that part takes no time at all.

EDIT

I can't post the entire generated DLLs, but here's the code for the test app:

static void Main(string[] args)
        {
            Console.WriteLine(string.Format("{0}: Process Started", DateTime.Now.ToLongTimeString()));

            // Initialize your data contexts
            var bank1 = new BNKPRD01(new iDB2Connection(ConfigurationManager.ConnectionStrings["DB2"].ConnectionString));
            var bank6 = new BNKPRD06(new iDB2Connection(ConfigurationManager.ConnectionStrings["DB2"].ConnectionString));
            Console.WriteLine(string.Format("{0}: Data contexts initialized", DateTime.Now.ToLongTimeString()));

            var foo = from t in bank1.TMX9800F where t.T9ADDEP > 0 select t; // <- runs slow
            Console.WriteLine(string.Format("{0}: {1} records found in BNKPRD01 test table", DateTime.Now.ToLongTimeString(), foo.Count().ToString()));

            var baz = from t in bank6.TMX9800F where t.T9ADDEP > 0 select t; // <- runs slow
            Console.WriteLine(string.Format("{0}: {1} records found in BNKPRD06 test table", DateTime.Now.ToLongTimeString(), baz.Count().ToString()));

            var ds = new DataSet();
            using (var conn = new iDB2Connection(ConfigurationManager.ConnectionStrings["DB2"].ConnectionString))
            {
                using (var cmd = conn.CreateCommand())
                {
                    cmd.CommandText = "SELECT * FROM BNKPRD01.TMX9800F WHERE T9ADDEP > 0";
                    new IBM.Data.DB2.iSeries.iDB2DataAdapter(cmd).Fill(ds);
                }
            }
            Console.WriteLine(string.Format("{0}: {1} records found in BNKPRD01 test table", DateTime.Now.ToLongTimeString(), ds.Tables[0].Rows.Count.ToString()));

            ds = new DataSet();
            using (var conn = new iDB2Connection(ConfigurationManager.ConnectionStrings["DB2"].ConnectionString))
            {
                using (var cmd = conn.CreateCommand())
                {
                    cmd.CommandText = "SELECT * FROM BNKPRD06.TMX9800F WHERE T9ADDEP > 0";
                    new IBM.Data.DB2.iSeries.iDB2DataAdapter(cmd).Fill(ds);
                }
            }
            Console.WriteLine(string.Format("{0}: {1} records found in BNKPRD06 test table", DateTime.Now.ToLongTimeString(), ds.Tables[0].Rows.Count.ToString()));

            Console.WriteLine("Press return to exit.");
            Console.ReadLine();
        }

Maybe I'm missing something obvious or there's something about LINQ I didn't grok?

EDIT

Upon discussion with Jon and Brian below, I've stepped further into the DB_Linq code that gets called when the LINQ query is created and came across the long step:

public override IEnumerable<MetaTable> GetTables()
        {
            const BindingFlags scope = BindingFlags.GetField |
                BindingFlags.GetProperty | BindingFlags.Static |
                BindingFlags.Instance | BindingFlags.NonPublic |
                BindingFlags.Public;
            var seen = new HashSet<Type>();
            foreach (var info in _ContextType.GetMembers(scope))
            {
                // Only look for Fields & Properties.
                if (info.MemberType != MemberTypes.Field && info.MemberType != MemberTypes.Property)
                    continue;
                Type memberType = info.GetMemberType();

                if (memberType == null || !memberType.IsGenericType ||
                        memberType.GetGenericTypeDefinition() != typeof(Table<>))
                    continue;
                var tableType = memberType.GetGenericArguments()[0];
                if (tableType.IsGenericParameter)
                    continue;
                if (seen.Contains(tableType))
                    continue;
                seen.Add(tableType);

                MetaTable metaTable;
                if (_Tables.TryGetValue(tableType, out metaTable))
                  yield return metaTable;
                else
                  yield return AddTableType(tableType);
            }
        }

That loop iterates 16,718 times.

+2  A: 

Posting the console app would really help.

Having many classes in a namespace and in an assembly will slow down compilation and there will be a one-time hit of JITting for each method in each type... but I wouldn't expect it to slow down LINQ queries.

You should check what SQL is actually being generated from your LINQ queries. I would expect the problem to lie there.

Jon Skeet
I'd thought so as well, but if I restrict the code generation to a subset of tables (say, 200 of them) then it runs _much_ faster. I'll see if I can grab the generated SQL and update the question with it...
David
I was just looking for a place to post the application when I realized that I'd be publicly posting the entire structure of the core database of a financial institution. We can debate all day whether or not that's a big deal, but ultimately that's not our call :(
David
@David: Ah, the point about getting a speed boost when you've got a subset of tables is potentially very significant. Definitely have a close look at the SQL, even if you can't post it.
Jon Skeet
@David: How interconnected are these tables? And have you tried using precompiled queries? (That way you may still get a big hit the first time, but much faster the second.)
Jon Skeet
To my knowledge the structure inside the database is very flat. They don't make much use (if any) of foreign keys and constraints. As a result I've just excluded that from the code generation steps when generating against DB2. Though something did just strike me as we talk about the difference on a subset of tables. I'm now thinking it's highly likely that the delay is in the DB_Linq library generating the SQL. I'll have to hook that code up to the debugger and see in more depth where that long step is.
David
@David: Ah - so this isn't a "normal" MS LINQ provider? That could certainly explain things.
Jon Skeet
@Jon: I wish it were :) I'm only slowly seeing how much the DB_Linq library has overridden. Understandably it had to be a lot. But now I'm wondering how many overrides I can comment out and still get _just_ my DB2 part working. I don't need all the other stuff the library supports.
David
+2  A: 

I just created a small test project with 10.000 classes in a namespace and while there's a noticeable overhead when loading/jitting the assembly I wouldn't say that it is particularly slow. So it is probably not the number of classes itself that's the reason for the bad performance you're seeing.

I'm Jon here, it would be helpful with more info on your test app.

Brian Rasmussen
I think it was your comment above that tipped the scales on this, so I'll give the answer to you. At this point we're past the original question (which you and Jon both effectively answered) so if I come across another potential road block in my development I'll just formulate a new question. Thanks!
David