tags:

views:

2150

answers:

2

I'm revisiting som old code of mine and have stumbled upon a method for getting the title of a website based on its url. It's not really what you would call a stable method as it often fails to produce a result and sometimes even produces incorrect results. Also, sometimes it fails to show some of the characters from the title as they are of an alternative encoding.

Does anyone have suggestions for improvements over this old version?

public static string SuggestTitle(string url, int timeout)
{
    WebResponse response = null;
    string line = string.Empty;

    try
    {
        WebRequest request = WebRequest.Create(url);
        request.Timeout = timeout;

        response = request.GetResponse();
        Stream streamReceive = response.GetResponseStream();
        Encoding encoding = System.Text.Encoding.GetEncoding("utf-8");
        StreamReader streamRead = new System.IO.StreamReader(streamReceive, encoding);

        while(streamRead.EndOfStream != true)
        {
            line = streamRead.ReadLine();
            if (line.Contains("<title>"))
            {
                line = line.Split(new char[] { '<', '>' })[2];
                break;
            }
        }
    }
    catch (Exception) { }
    finally
    {
        if (response != null)
        {
            response.Close();
        }
    }

    return line;
}

One final note - I would like the code to run faster as well, as it is blocking until the page as been fetched, so if I can get only the site header and not the entire page, it would be great.

A: 

Inorder to accomplish this you are going to need to do a couple of things.

  • Make your app threaded, so that you can process multiple requests at the time and maximize the number of HTTP requests that are being made.
  • Durring the async request, download only the amount of data you want to pull back, you could probably do parsing on the data as it comes back looking for
  • Probably want to use regex to pull out the title name

I have done this before with SEO bots and I have been able to handle almost 10,000 requests at a single time. You just need to make sure that each web request can be self contained in a thread.

Nick Berardi
You certainly *don't* want to give each request its own thread if you want to handle 10,000 requests at a time! (The stack involved would eat you your memory like crazy.) Using an async API will parallelize the operation *without* costing you a thread per request.
Jon Skeet
Its a moot point as I only need to perform a single request at a time. The need for speed is because the user is waiting for the reply.
Morten Christiansen
@Jon, well like I said mine was an SEO bot that analyzes and obviously you want to put limits on the number of requests at a time per analysis to keep the memory reasonable. However the 10,000 was a stress test scenario. And the async was a suggestion on how to just download the header.
Nick Berardi
@Morten, I was just going off the very basic details you gave me. You said you wanted it to run faster, and that you only wanted to download the header the async request is the best way to limit the size that is downloaded, because you can stop the process when you have found your answer.
Nick Berardi
@Jon, you are using a pretty definite statement in that you don't want a thread for each request, that may be true but you are forgetting about the analysis that goes along with each request. There would be a horrible queue build up if the analysis processor was single threaded.
Nick Berardi
+11  A: 

A simpler way to get the content:

WebClient x = new WebClient();
string source = x.DownloadString("http://www.singingeels.com/");

A simpler, more reliable way to get the title:

string title = Regex.Match(source, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
Timothy Khouri
Is there any way to set a timeout when using WebClient?
Morten Christiansen
I think the only thing to add is that you have to add @ (for the escape stuff) to the pattern, that's to say: @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>"
netadictos
For adding timeout (and other stuff) to the WebClient class, this guide provides a good solution: http://codegator.com/mcook/archive/2006/07/17/extending-webclient-using-c.aspx
Morten Christiansen