views:

203

answers:

4

EDIT: Problem wasn't related to the question. It was indeed something wrong with my code, and actually, it was so simple that I don't want to put it on the internet. Thanks anyway.

I read in roughly 550k Active directory records and store them in a List, the class being a simple wrapper for an AD user. I then split the list of ADRecords into four lists, each containing a quarter of the total. After I do this, I read in about 400k records from a database, known as EDR records, into a DataTable. I take the four quarters of my list and spawn four threads, passing each one of the four quarters. I have to match the AD records to the EDR records using email right now, but we plan to add more things to match on later.

I have a foreach on the list of AD records, and inside of that, I have to run a for loop on the EDR records to check each one, because if an AD record matches more than one EDR record, then that isn't a direct match, and should not be treated as a direct match.

My problem, by the time I get to this foreach on the list, my ADRecords list only has about 130 records in it, but right after I pull them all in, I Console.WriteLine the count, and it's 544k.

I am starting to think that even though I haven't set the list to null to be collected later, C# or Windows or something is actually taking my list away to make room for the EDR records because I haven't used the list in a while. The database that I have to use to read EDR records is a linked server, so it takes about 10 minutes to read them all in, so my list is actually idle for 10 minutes, but it's never set to null.

Any ideas?

//splitting list and passing in values to threads.
List<ADRecord> adRecords = GetAllADRecords();
        for (int i = 0; i < adRecords.Count/4; i++)
        {
            firstQuarter.Add(adRecords[i]);
        }
        for (int i = adRecords.Count/4; i < adRecords.Count/2; i++)
        {
            secondQuarter.Add(adRecords[i]);
        }
        for (int i = adRecords.Count/2; i < (adRecords.Count/4)*3; i++)
        {
            thirdQuarter.Add(adRecords[i]);
        }
        for (int i = (adRecords.Count/4)*3; i < adRecords.Count; i++)
        {
            fourthQuarter.Add(adRecords[i]);
        }
        DataTable edrRecordsTable = GetAllEDRRecords();

        DataRow[] edrRecords = edrRecordsTable.Select("Email_Address is not null and Email_Address <> ''", "Email_Address");
        Dictionary<string, int> letterPlaces = FindLetterPlaces(edrRecords);
        Thread one = new Thread(delegate() { ProcessMatches(firstQuarter, edrRecords, letterPlaces); });
        Thread two = new Thread(delegate() { ProcessMatches(secondQuarter, edrRecords,  letterPlaces); });
        Thread three = new Thread(delegate() { ProcessMatches(thirdQuarter, edrRecords,  letterPlaces); });
        Thread four = new Thread(delegate() { ProcessMatches(fourthQuarter, edrRecords, letterPlaces); });
        one.Start();
        two.Start();
        three.Start();
        four.Start();

In ProcessMatches, there is a foreach on the List of ADRecords passed in. The first line in the foreach is AdRecordsProcessed++; which is a global static int, and the program finishes with it at 130 instead of the 544k.

+4  A: 

The variable is never set to null and is still in scope? If so, it shouldn't be collected and idle time isn't your problem.

First issue I see is:

AdRecordsProcessed++; 

Are you locking that global variable before updating it? If not, and depending on how fast the records are processed, it's going to be lower than you expect.

Try running it from a single thread (i.e. pass in adRecords instead of firstQuarter and don't start the other threads. Does it work as expected with 1 thread?

Kendrick
What does one do with an "answer" such as this? :) Sounds like you're asking the OP a question -- i.e. a comment.
Kirk Woll
My question is answered in the post now, but you're right, the second half of my answer should have been a comment.
Kendrick
What about the first half?
jalf
@jalf the first half was the answer. "idle time isn't your problem". I've left that in place.
Kendrick
In response to your "First issue", you can't lock on ints. the lock command requires a reference type.
seekerOfKnowledge
I haven't tried, but I still assume the ++ operation is not thread safe and therefore may be an issue. Lock the block of code that modifies the variable and then you've effectivly locked the variable. Either way, I'd run it with a single thread (possibly on a reduced rowset) and see if it behaves there. If so, threading may be your issue.
Kendrick
"static readonly object _locker = new object();" with this at the class level, I can lock on it around my ++ operations, and my numbers are more accurate. Thank you.
seekerOfKnowledge
+2  A: 

Firstly, you don't set a list to null. What you might do is set every reference to a list to null (or to another list), or all such references might simply fall out of scope. This may seem like a nitpick point, but if you are having to examine what is happening to your data it's time to be nitpicky on such things.

Secondly, getting the GC to deallocate something that has a live reference is pretty hard to do. You can fake it with a WeakReference<> or think you've found it when you hit a bug in a finaliser (because the reference isn't actually live, and even then its a matter of the finaliser trying to deal with a finalised rather than deallocated object). Bugs can happen everywhere, but that you've found a way to make the GC deallocate something that is live is highly unlikely.

The GC will be likely do two things with your list:

  1. It is quite likely to compact the memory used by it, which will move its component items around.
  2. It is quite likely to promote it to a higher generation.

Neither of these are going to have any changes you will detect unless you actually look for them (obviously you'll notice a change in generation if you keep calling GetGeneration(), but aside from that you aren't really going to).

The memory used could also be paged out, but it will be paged back in when you go to use the objects. Again, no effect you will notice.

Finally, if the GC did deallocate something, you wouldn't have a reduced number of items, you'd have a crash, because if objects just got deallocated the system will still try to use the supposedly live references to them.

So, while the GC or the OS may do something to make room for your other object, it isn't something observable in code, and it does not stop the object from being available and in the same programmatic state.

Something else is the problem.

Jon Hanna
This is almost worth a downvote. GC is almost certainly not a problem here, and your answer suggests that it might be.
John Saunders
@John, where do I suggest that GC might be causing problems? I've explained how the things GC would do won't deallocate anything, and that if it somehow did it would cause a crash rather than a reduction in items, before saying "Something else is the problem". What in that suggests GC is the problem?
Jon Hanna
@Jon: the OP was looking for the problem with his code, and suggested that the problem might be the GC. That was almost certainly not the case, so I felt you should not have mentioned the GC in your answer.
John Saunders
@John. Since I don't know what actually is the problem, not mentioning the GC in explaining why it probably isn't the problem is quite difficult.
Jon Hanna
A: 

Is there a reason you have to get all the data all at once? If you break the data up into chunks it should be more manageable. All I know is having to get into GC stuff is a little smelly. Best to look at refactoring your code.

Wix
I suppose this will have to be the case. I'll add a final comment to the question giving another detail that might stump some people, but who knows.
seekerOfKnowledge
A: 

The garbage collector will not collect:

  • A global variable
  • Objects managed by static objects
  • A local variable
  • A variable referencable by any method on the call stack

So if you can reference it from your code, there is no possibility that the garbage collector collected it. No way, no how.

In order for the collector to collect it, all references to it must have gone away. And if you can see it, that's most definitely not the case.

Mike Hofer