views:

143

answers:

4

I would like to implement a mulithtreaded crawler using the single thread crawler code I have now. Basically I read the urls from a text file, take each one and crawl and parse it. I know how thread basics of creating a thread and assigning a process to it but not too sure how to implement in the following way:

I need at least 3 threads and need to assign a url to each thread from a list of urls, and then each needs to go and fetch it and parse it before adding contents to a database.

Dim gthread, tthread, ithread As Thread

        gthread = New Thread(AddressOf processUrl)
        gthread.Start(url)

        tthread = New Thread(AddressOf processUrl))
        tthread.Start(url)


        ithread = New Thread(AddressOf processUrl))
        ithread.Start(url)

WaitUntilAllAreOver:

        If gthread.ThreadState = ThreadState.Running Then
            Thread.Sleep(5)
            GoTo WaitUntilAllAreOver
        End If

'etc..

Now the code maynot make sense but what I need to do is add a unique url to each thread to go process.

Any ideas appreciated

+2  A: 

The best way to wait for the Thread instances to finish is to call the .Join method. Take the following example

Public Sub ParseAll(ByVal ParamArray urls As Uri()) 
  Dim list as New List(Of Thread)
  For Each url in urls
    Dim thread = New Thread(AddressOf ProcessUrl)
    thread.Start(url)
    list.Add(thread)
  Next
  For Each thread in list
    thread.Join
  Next
End Sub

Though you may want to consider using the ThreadPool here. The ThreadPool is designed for spawning off lots of small tasks very efficiently.

JaredPar
JaredPar depending on the size of the list, this could be very inefficient. Besides, the question asked to limit it to 3 threads.
Lucas B
Let me see if I understand -- List is the list of possible threads; then for each url in url list, create a new thread and assign it the function processUrl, then start the thread and add it to the possible thread list. Then from what I have read .join blocks a new thread until the other is complete. But I want the threads to fetch the url, parse it and add to database concurrently to speed up process?
vbNewbie
@Lucas B, essentially any hand management of threading can be inneficient. Scheduling of many threads is best left up to APIs like the ThreadPool but the OP asked for straight Thread instances.
JaredPar
Thanks JaredPar, appreciate everyone else's responses as well. This seems to work ok for now.
vbNewbie
Ok one issue has turned up. I have a logging class and functions were I write to a log file and error file. There was a clash with an error:"The process cannot access the file because it is being used by another process.
vbNewbie
You're going to need to wrap the log access in a lock.
Steven Sudit
This is my code to write the parsed content to a file and I am getting the collision error:Try SyncLock outfile outfile.WriteLine(link) End SyncLock outfile.Close() Catch ex As Exception execError = ex.Message End Try
vbNewbie
@vbNewbie, to get around logging issues I recommend you use an aspect oriented solution, like log4net.
Lucas B
+2  A: 

You could use a synchronized Queue where u push the URLs to and every crawler takes the next URL it visits out of this Queue. When they detect new URLs, the push them into the Queue, too.

ZeissS
This is a much better approach than hand-coding the threads.
Steven Sudit
+1  A: 

I recommend using a Background worker to accomplish this.

Lucas B
+1  A: 

Look into the Concurrency and Coordination Runtime (CCR). I have built a few crawlers based on that framework, and it makes things very easy once you understand how the CCR works.

Should take you a few hours to get up to speed with the CCR.

Bryan Batchelder
I will look into this some other time, thanks. I barely understand threads right now. Excuse my lack of intelligence.
vbNewbie