ansaurus

Question

How to simulate a web browser so the website feeds me the correct HTML source?

Answer 1

A:

WebClient client = new WebClient ();
client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

SpliFF 2009-10-23 00:57:39

I've added those headers your said and it's still feeding me whatever HTML is wants. Complete unrelated to my search. I download the HTML off of a URL, so I format a search URL then download the source html. What can I do about this problem! :(

Sergio Tapia 2009-10-23 01:01:10

Sounds like your problem is actually related to cookies, you're proably being redirected and your script isn't providing keys needed to maintain your session. Alternatively there is some javascript involved in the sorting or you haven't correct URL-encoded your parameters. If you want to persue this further it might be helpful to ask as a new question since this one, as worded, has been answered.

SpliFF 2009-10-23 01:18:19

Answer 2

A:

Try adding a user agent header so it thinks you are one of the major browsers (IE, FF, etc)

client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

Matt Wrock 2009-10-23 00:58:58

Answer 3

A:

Try printing out the WebClient's headers - maybe there is something strange in there by default that could be cluing the site in to not being a browser?

Ben 2009-10-23 01:25:48

Answer 4

+2 A:

Maybe I'm missing something, but following code ran without problems:

Regex torrents = new Regex(
    @"<tr>[\s\S]*?<td><a href=""(?<link>.*?)"".*?>" + 
    @"(?<name>.*?)</a>[\s\S]*?<td><a href=""(?<torrent>.*?)""[\s\S]*?>" + 
    @"(?<size>\d+\.?\d*)&nbsp;(?<unit>.)iB</td>");
Uri url = new Uri("http://thepiratebay.org/search/documentary/0/7/200");

WebClient client = new WebClient();
string html = client.DownloadString(url);
//string html = Encoding.Default.GetString(client.DownloadData(url));

foreach (Match torrent in torrents.Matches(html))
{
    Console.WriteLine("{0} ({1:0.00}{2}b)", 
        torrent.Groups["name"].Value, 
        Double.Parse(torrent.Groups["size"].Value), 
        torrent.Groups["unit"].Value);
    Console.WriteLine("\t{0}", 
        new Uri(url, torrent.Groups["link"].Value).LocalPath);
    Console.WriteLine("\t{0}",
        new Uri(torrent.Groups["torrent"].Value).LocalPath);
}

Rubens Farias 2009-10-23 01:42:08

Yes, testcase error.

Henk Holterman 2009-10-23 07:38:51

@Henk, what's wrong?

Rubens Farias 2009-10-23 10:31:41

"testcase error" means the error in the question failed to be reproduced.

Henk Holterman 2009-10-23 18:33:45

@Henk, misunderstood, again; ty

Rubens Farias 2009-10-23 18:48:58

Answer 5

A:

HTTP is a textual protocol that is very human-readable! Connect to the site using telnet and type in the HTTP requests by hand. This allows you full control over the user-agent string and other associated information. It's also dead simple.

When you get this working by hand, you should be able to add this functionality to your app with some very basic socket programming.

More: http://en.wikipedia.org/wiki/Hypertext%5FTransfer%5FProtocol

I'd post links to the RFC and the Wikipedia page on the user-agent string, but I just joined.

foxostro 2009-10-23 01:50:21

ansaurus

tags:

views:

answers:

How to simulate a web browser so the website feeds me the correct HTML source?

related questions