views:

169

answers:

7

I want to process some data. I have about 25k items in a Dictionary. IN a foreach loop, I query a database to get results on that item. They're added as value to the Dictionary.

foreach (KeyValuePair<string, Type> pair in allPeople)
{
    MySqlCommand comd = new MySqlCommand("SELECT * FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", con);
    MySqlDataReader reader2 = comd.ExecuteReader();
    Dictionary<string, Dictionary<int, Log>> allViews = new Dictionary<string, Dictionary<int, Log>>();
    while (reader2.Read())
    {
        if (!allViews.ContainsKey(reader2.GetString("src")))
        {
            allViews.Add(reader2.GetString("src"), reader2.GetInt32("time"));
        }
    }
    reader2.Close();
    reader2.Dispose();
    allPeople[pair.Key].View = allViews;
}

I was hoping to be able to do this faster by multi-threading. I have 8 threads available, and CPU usage is about 13%. I just don't know if it will work because it's relying on the MySQL server. On the other hand, maybe 8 threads would open 8 DB connections, and so be faster.

Anyway, if multi-threading would help in my case, how? o.O I've never worked with (multiple) threads, so any help would be great :D

+2  A: 

The biggest problem that comes to mind is that you are going to use multithreading to add values to a dictionary, which isn't thread safe.

You'll have to do something like this to make it work, and you might not get that much of a benefit from implementing it this was as it still has to lock the dictionary object to add a value.

Kevin
Well, the dictionary you see is only a temporary one. Per item in the main dictionary, the value is altered in this loop, with the temporary dictionary. No entries from the main dictionary are removed or added. Only altered. I thought that making it multi-threaded would let 8 threads handle the first 8 of the dictionary, and when one thread finishes, it takes entry 9 etc etc
lordstyx
The allViews dictionary is the problem here. See, if there were two threads trying to write to allviews at the same time without a lock that'd create a problem. While you could use locks to get around the issue, I'm not sure how much gain you'd get from it.That said, I have to agree with Andrey's statement about the reader - reader2 is the real culprit. There are workarounds for most other things but not reader2.
diadem
Multithreaded works however you tell it to work. Generally everything happens at the same time which can lead to its own set of issues. Without safeguards it's like an intersection with no traffic light.
diadem
Also, reading from the same MySqlDataReader in multiple threads isn't safe.
nos
Depends on if you are talking about the allView or allPeople dictionary and how this code can be written to take advantage of parallelism. I assumed that all the code within the foreach would be the processing unit, so allView is only one thread and allPeople is updating a property on a value. Depending on whether the values are unique with determine whether updating the property is thread safe. Looking at the code snippet it seems reasonable to assume this is the case.
btlog
Since the OP is using C# 4.0, he/she should use the [ConcurrentDictionary](http://msdn.microsoft.com/en-us/library/dd287191.aspx) and not have any problems with multithreaded modification of the dictionary.
Lirik
+5  A: 

MySqlDataReader is stateful - you call Read() on it and it moves to the next row, so each thread needs their own reader, and you need to concoct a query so they get different values. That might not be too hard, as you naturally have many queries with different values of pair.Key.

You also need to either have a temp dictionary per thread, and then merge them, or use a lock to prevent concurrent modification of the dictionary.

The above assumes that MySQL will allow a single connection to perform concurrent queries; otherwise you may need multiple connections too.

First though, I'd see what happens if you only ask the database for the data you need ("SELECT src,time FROMlogsWHERE IP = '" + pair.Key + "' GROUP BY src") and use GetString(0) and GetInt32(1) instead of using the names to look up the src and time; also only get the values once from the result.

I'm also not sure on the logic - you are not ordering the log events by time, so which one is the first returned (and so is stored in the dictionary) could be any of them.

Something like this logic - where each of N threads only operates on the *N*th pair, each thread has its own reader, and nothing actually changes allPeople, only the properties of the values in allPeople:

    private void RunSubQuery(Dictionary<string, Type> allPeople, MySqlConnection con, int threadNumber, int threadCount)
    {
        int hoppity = 0; // used to hop over the keys not processed by this thread

        foreach (var pair in allPeople)
        {
            // each of the (threadCount) threads only processes the (threadCount)th key
            if ((hoppity % threadCount) == threadNumber)
            {
                // you may need con per thread, or it might be that you can share con; I don't know
                MySqlCommand comd = new MySqlCommand("SELECT src,time FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", con);

                using (MySqlDataReader reader = comd.ExecuteReader())
                {
                    var allViews = new Dictionary<string, Dictionary<int, Log>>();

                    while (reader.Read())
                    {
                        string src = reader.GetString(0);
                        int time = reader.GetInt32(1);

                        // do whatever to allViews with src and time
                    }

                    // no thread will be modifying the same pair.Value, so this is safe
                    pair.Value.View = allViews;
                }
            }

            ++hoppity;
        }
    }

This isn't tested - I don't have MySQL on this machine, nor do I have your database and the other types you're using. It's also rather procedural (kind of how you would do it in Fortran with OpenMPI) rather than wrapping everything up in task objects.

You could launch threads for this like so:

    void RunQuery(Dictionary<string, Type> allPeople, MySqlConnection connection)
    {
        lock (allPeople)
        {
            const int threadCount = 8; // the number of threads

            // if it takes 18 seconds currently and you're not at .net 4 yet, then you may as well create
            // the threads here as any saving of using a pool will not matter against 18 seconds
            //
            // it could be more efficient to use a pool so that each thread takes a pair off of 
            // a queue, as doing it this way means that each thread has the same number of pairs to process,
            // and some pairs might take longer than others
            Thread[] threads = new Thread[threadCount];

            for (int threadNumber = 0; threadNumber < threadCount; ++threadNumber)
            {
                threads[threadNumber] = new Thread(new ThreadStart(() => RunSubQuery(allPeople, connection, threadNumber, threadCount)));
                threads[threadNumber].Start();
            }

            // wait for all threads to finish
            for (int threadNumber = 0; threadNumber < threadCount; ++threadNumber)
            {
                threads[threadNumber].Join();
            }
        }
    }

The extra lock held on allPeople is done so that there is a write barrier after all the threads return; I'm not quite sure if it's needed. Any object would do.

Nothing in this guarantees any performance gain - it might be that the MySQL libraries are single threaded, but the server certainly can handle multiple connections. Measure with various numbers of threads.


If you're using .net 4, then you don't have to mess around creating the threads or skipping the items you aren't working on:

    // this time using .net 4 parallel; assumes that connection is thread safe
    static void RunQuery(Dictionary<string, Type> allPeople, MySqlConnection connection)
    {
        Parallel.ForEach(allPeople, pair => RunPairQuery(pair, connection));
    }

    private static void RunPairQuery(KeyValuePair<string, Type> pair, MySqlConnection connection)
    {
        MySqlCommand comd = new MySqlCommand("SELECT src,time FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", connection);

        using (MySqlDataReader reader = comd.ExecuteReader())
        {
            var allViews = new Dictionary<string, Dictionary<int, Log>>();

            while (reader.Read())
            {
                string src = reader.GetString(0);
                int time = reader.GetInt32(1);

                // do whatever to allViews with src and time
            }

            // no iteration will be modifying the same pair.Value, so this is safe
            pair.Value.View = allViews;
        }
    }
Pete Kirkham
Please bare in mind that segmenting the sql statements may or may not add an additional performance decrease - testing would need to be done.
diadem
I am currently testing with a query that only takes the fields I need.
lordstyx
Okay when I only query on the fields I need it still takes 18 minutes.. It's only a few ms faster. I will try your code, I have one question though xD I have no idea how to "// run this in each thread, with successive values of threadNumber;"Could you explain? Like I said I'm completely new to multi-threading.
lordstyx
A: 

I think if you are running this on a multi core machine you could gain benefits from multi threading.

However the way I would approach it is to first look at unblocking the thread you are currently using by making asynchronous database calls. The call backs will execute on background threads, so you will get some multi core benefit there and you won't be blocking threads waiting for the db to come back.

For IO intensive apps like this example sounds like you are likely to see improved throughput depending on what load the db can handle. Assuming the db scales to handle more than one concurrent request you should be good.

btlog
+1  A: 

Before you do anything else, find out exactly where the time is being spent. Check the execution plan of the query. The first thing I'd suspect is a missing index on logs.IP.

18 minutes for something like this seems much too long to me. Even if you can cut the execution time in eight by adding more threads (which is unlikely!) you still end up using more than 2 minutes. You could probably read the whole 25k rows into memory in less than five seconds and do the necessary processing in memory...

EDIT: Just to clarify, I'm not advocating actually doing this in memory, just saying that it looks like there's a bigger bottleneck here that can be removed.

odd parity
A: 

Thanks everyone for your help. Currently I am using this

for (int i = 0; i < 8; i++)
{
    ThreadPool.QueueUserWorkItem(addDistinctScres, i);
}

ThreadPool to run all the threads. I use the method provided by Pete Kirkham, and I'm creating a new connection per thread. Times went down to 4 minutes.

Next I'll make something wait for the callback of the threadpool? before performing other functions.

I think the bottleneck now is the MySQL server, because the CPU usage has drops.

@odd parity I thought about that, but the real thing is waaay more than 25k rows. Idk if that'd work.

lordstyx
Hehe, I didn't mean that you should _actually_ load the whole table into memory, just that if you compare the time an "ideal" approach would take with the current time spent it is obvious that there is a bottleneck somewhere that hasn't been addressed. All I'm saying is that your time would probably be better spent finding that bottleneck than optimizing the current code.
odd parity
Oh okay. Well after removing the single thread bottleneck (now using 8 :D) I think the new bottleneck is the database, as the CPU usage has peak drops in the process. I don't know what else it could be.
lordstyx
+1  A: 

Assumptions:

  1. There is a table People in your database
  2. There are alot of people in your database

Each database query adds overhead you are doing one db query for each of the people in your database I would suggest it was faster to get all the data back in one query then to make repeated calles

select l.ip,l.time,l.src 
  from logs l, people p 
  where l.ip = p.ip
  group by l.ip, l.src

Try this with a loop in a single thread, I belive this will be much faster then your existing code.

With in your existing code another thing you can do is to take the creation of the MySqlCommand out of the loop, prepare it in advance and just change the parameter. This should speed up execution of the SQL. see http://dev.mysql.com/doc/refman/5.0/es/connector-net-examples-mysqlcommand.html#connector-net-examples-mysqlcommand-prepare

MySqlCommand comd = new MySqlCommand("SELECT * FROM `logs` WHERE IP = ?key GROUP BY src", con);
comd.prepare();
comd.Parameters.Add("?key","example");
foreach (KeyValuePair<string, Type> pair in allPeople)
{
    comd.Parameters[0].Value = pair.Key;

If you are using mutiple threads, each thread will still need there own Command, at lest in MS-SQL this would still be faster even if you recreated and prepared the statment every time, due to the ability for the SQL server to be able to cache the execution plan of a paramertirised statment.

David Waters
No, there's one table with the columns IP | time | srcI am actually trying to make something that will analyse those logs and find spams/patterns in them. It's nothing too serious, and I was just wondering how I could load it faster :P
lordstyx
@lordstyx in your post where does the allPeople variable come from and how many elements are there in it?
David Waters
allPeople is a global Dictionary with as key the IP, and values are asigned in this function. The IPs are fetched in a diff function.
lordstyx
Interesting. Instead of creating a new command every loop, and just changing the parameter cut another 30 sec off the time o.O@Pete Kirkham: when I use that Parallel.ForEach method I get a bunch of errors concerning that the reader at that connection must be closed first. I could try to make a connection for everytime, but I don't know how that would influence times.
lordstyx
A: 

This sound like the perfect job for map/reduce, i am not a .Net-programmer, but this seems like a reasonable guide: http://ox.no/posts/minimalistic-mapreduce-in-net-4-0-with-the-new-task-parallel-library-tpl

olle