views:

16

answers:

1

I have a crawler that download pages and process them. after 1 hour every request to the sources takes 1 minute or more to complete but in the start of the program every address is downloaded under 1 second, I suspected that destination web sites limit my request or traffic but when i close the program and run it again performance returns back to normal. so there is some issue with my code Any help would greatly appreciate

public class PageFetcher
{
    public PageFetcher() { }
    public PageFetcher(string urlAddress) { URLAddress = urlAddress; }
    private int relayPageCount = 0;

    public string URLAddress { get; set; }
    public string FetchingEncoding { get; set; }

    public PageFetchResult Fetch()
    {
        PageFetchResult fetchResult = new PageFetchResult();
        HttpWebRequest req = null;
        HttpWebResponse resp = null;
        try
        {
            req = (HttpWebRequest)HttpWebRequest.Create(URLAddress);
            req.UserAgent = "Mozilla/4.0";
            req.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate");
            resp = (HttpWebResponse)req.GetResponse();

            string resultHTML = "";
            byte[] reqHTML = ResponseAsBytes(resp);
            Stream resultStream = new MemoryStream(reqHTML);
            resultStream.Position = 0;

            string contentEncoding = resp.ContentEncoding.ToLower();
            if (contentEncoding.Contains("gzip") || contentEncoding.Contains("deflate"))
            {
                if (contentEncoding.Contains("gzip"))
                    resultStream = new GZipStream(resultStream, CompressionMode.Decompress);

                if (contentEncoding.Contains("deflate"))
                    resultStream = new DeflateStream(resultStream, CompressionMode.Decompress);
            }

            StreamReader readerStream = null;

            if (!string.IsNullOrEmpty(FetchingEncoding))
                readerStream = new StreamReader(resultStream, Encoding.GetEncoding(FetchingEncoding));
            else if (!string.IsNullOrEmpty(resp.CharacterSet))
                readerStream = new StreamReader(resultStream, Encoding.GetEncoding(resp.CharacterSet));

            resultHTML = readerStream.ReadToEnd();
            req.Abort();
            resp.Close();
            fetchResult.IsOK = true;
            fetchResult.ResultHTML = resultHTML;
            URLAddress = resp.ResponseUri.AbsoluteUri;
        }
        catch (Exception ex)
        {
            if (req != null)
                req.Abort();
            if (resp != null)
                resp.Close();
            fetchResult.IsOK = false;
            fetchResult.ErrorMessage = ex.Message;
        }
        return fetchResult;
    }
+1  A: 

Your code looks good. There are a few possible reasons you could be seeing a slowdown.

First: upstream caching. For example, if you're running a test multiple times on the same group of URLs, once you hit a URL it will be cached in many places-- your proxy server (if you have one), your internet provider's proxies (if they have them), the destination site's caching, etc.

So for any crawler speed test, make sure you're picking URLs from random from a very large set of URLs (large enough that the chance of any of them being cached as a result of your crawling is very small).

Next, I'd look at the rest of your app. Are you sure it's the actual fetching that's slow? Or is some other part of your app (e.g. the part that stores the fetched data) perhaps causing the slowdown. Since most data-storage mechanisms (in-RAM, in a database, etc.) get a lot slower as you put more data in them, you should ensure you're only timing your HTTP access and not looking at overall crawler throughput including storage.

Justin Grant
Yes fetching is going to slow down and I look into HttpWebResponse.GetResponse() that is the main criminal ! do you think it is related to destination website or the other end ?
Ehsan