views:

830

answers:

2

I'm writing a web crawler for a specific site. The application is a VB.Net Windows Forms application that is not using multiple threads - each web request is consecutive. However, after ten successful page retrievals every successive request times out.

I have reviewed the similar questions already posted here on SO, and have implemented the recommended techniques into my GetPage routine, shown below:

Public Function GetPage(ByVal url As String) As String
    Dim result As String = String.Empty

    Dim uri As New Uri(url)
    Dim sp As ServicePoint = ServicePointManager.FindServicePoint(uri)
    sp.ConnectionLimit = 100

    Dim request As HttpWebRequest = WebRequest.Create(uri)
    request.KeepAlive = False
    request.Timeout = 15000

    Try
        Using response As HttpWebResponse = DirectCast(request.GetResponse, HttpWebResponse)
            Using dataStream As Stream = response.GetResponseStream()
                Using reader As New StreamReader(dataStream)
                    If response.StatusCode <> HttpStatusCode.OK Then
                        Throw New Exception("Got response status code: " + response.StatusCode)
                    End If
                    result = reader.ReadToEnd()
                End Using
            End Using
            response.Close()
        End Using

    Catch ex As Exception
        Dim msg As String = "Error reading page """ & url & """. " & ex.Message
        Logger.LogMessage(msg, LogOutputLevel.Diagnostics)
    End Try

    Return result

End Function

Have I missed something? Am I not closing or disposing of an object that should be? It seems strange that it always happens after ten consecutive requests.

Notes:

  1. In the constructor for the class in which this method resides I have the following:

    ServicePointManager.DefaultConnectionLimit = 100

  2. If I set KeepAlive to true, the timeouts begin after five requests.

  3. All the requests are for pages in the same domain.

EDIT

I added a delay between each web request of between two and seven seconds so that I do not appear to be "hammering" the site or attempting a DOS attack. However, the problem still occurs.

A: 

I think the site has some sort of DOS protection, which kicks in when it's hit with a number of rapis requests. You may want to try setting the UserAgent on the webrequest.

pb
Thanks for the reply. If DOS protection is the cause, why the difference in timeouts when KeepAlive is true vs. when KeepAlive is false?
Bob Mc
Setting UserAgent improved the situation. I'm still getting some timeouts but they're more random and not nearly as frequent. So before I mark this as correct can you expand your answer to indicate why setting UserAgent has this effect?
Bob Mc
After further testing, it helps a little but I still eventually get a timeout on a page and all successive pages, so I'm starting to believe the DOS protection theory.
Bob Mc
I don't think DoS protection is the answer - see my answer for a possible solution.
Geoff
A: 

I ran into this issue today and my resolution was to ensure that the response was closed at all times.

I think that you need to put in a response.Close() before you throw your exception inside the using.

Using response As HttpWebResponse = DirectCast(request.GetResponse, HttpWebResponse) 
        Using dataStream As Stream = response.GetResponseStream() 
            Using reader As New StreamReader(dataStream) 
                If response.StatusCode <> HttpStatusCode.OK Then 
                    response.Close()  
                    Throw New Exception("Got response status code: " + response.StatusCode) 
                End If 
                result = reader.ReadToEnd() 
            End Using 
        End Using 
        response.Close() 
    End Using
Geoff
@Geoff: the `Using response` ensures response gets closed.
John Saunders
@Geoff: I gave it a try anyway, but it didn't help. I believe John is correct - the `Using response` should take care of closing things properly. Thanks for the reply though.
Bob Mc