views:

57

answers:

4

I have written a multithreaded crawler and the process is simply creating threads and having them access a list of urls to crawl. They then access the urls and parse the html content. All this seems to work fine. Now when I need to write to tables in a database is when I experience issues. I have 2 declared arraylists that will contain the content each thread parse. The first arraylist is simply the rss feed links and the other arraylist contains the different posts. I then use a for each loop to iterate one while sequentially incrementing the other and writing to the database. My problem is that each time a new thread accesses one of the lists the content is changed and this affects the iteration. I tried using nested loops but it did not work before and this works fine using a single thread.I hope this makes sense. Here is my code:

          SyncLock dlock
                For Each l As String In links
                    finallinks.Add(l)
                Next
            End SyncLock

                SyncLock dlock
                For Each p As String In posts
                    finalposts.Add(p)
                Next
                End SyncLock
             ...
                Dim i As Integer = 0
          SyncLock dlock
         For Each rsslink As String In finallinks
             postlink = finalposts.Item(i)
             i = i + 1

finallinks and finalposts are the two arraylists. I did not include the rest of the code which shows the threads working but this is the essential part where my error occurs which is basically here

postlink = finalposts.Item(i)
i = i + 1
ERROR: index was out of range. Must be non-negative and less than the size of the collection

Is there an alternative?

+1  A: 

looks like collection finallinks is larger than finalposts, that's it

Andrey
Ok I feel dumb now, thanks for your response. This works fine using single threads but how else can I lock finallinks and finalposts to write them to a table while the lists that are getting copied to finallinks and finalposts are used by the threads and not locked for performance reasons. Does this make sense? Do you have an alternative way I should do this?
vbNewbie
Is there anyone who can help with this?
vbNewbie
your request gets complicated. read some articles about multithreading, then you will work it out yourself.
Andrey
+1  A: 

See if a ProducerConsumer class will work for you. Your parsing threads will be the producers and your database threads will be the consumers.

If you read the linked page, and try out the code, you should be able to adapt it to your needs.

magnifico
+1  A: 

I believe there is a .NET container called BlockingContainer or something similar that is suitable for Producer-Consumer - patterns, I assume you are working in vb.net.

dutt
+1  A: 

About he question you ask to Andrey:

You can not (or better, you shouldn't) access the finallinks and finalposts to read and write at the same time so you need to lock because the arraylists instance methods are not safe for multithreading.

So to make it easy, you can not add items to them while reading them to write to a datatable. What you can do is to lock the lists, create a clone of them to write the clone to the datatable, clear the original list and unlock them. This way you have a list to write to DB and another one to be filled by the threads.

I hope this helps.

SoMoS
here is what I did before seeing everyone elses posts:SyncLock dlock For Each l As String In links finallinks.Add(l) NextEnd SyncLock 'lock temp posts to add to listSyncLock dlock For Each p As String In posts finalposts.Add(p) Next End SyncLockTry syncLock dlock Dim outfile As New StreamWriter(Environment.CurrentDirectory, True) Dim i As Integer = 0 For Each link As String In finallinks While i < postlink.Count() postlink = finalposts.Item(i) i = i + 1 End While
vbNewbie
It hurts to read your code here :) Can you put it at an edit in your question?
SoMoS