views:

222

answers:

5

I'm trying to webscrape a website and it appears to be feeding me bogus HTML with the WebClient.DownloadData() method.

Is there a way for me to "fool" the website that I'm a browser of sorts?

Edit:

Adding this header still doesn't fix the issue:

Client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

Is there something else I can try? :)

Edit 2:

If it helps at all. I'm trying to download the source of a ThePirateBay Search.

This URL: http://thepiratebay.org/search/documentary/0/7/200

As you can see, the source shows what is should, seed information for the movies etc. But when I use the DownloadData() method, I get random torrent results, nothing at all related to what I'm search for.

A: 
WebClient client = new WebClient ();
client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
SpliFF
I've added those headers your said and it's still feeding me whatever HTML is wants. Complete unrelated to my search. I download the HTML off of a URL, so I format a search URL then download the source html. What can I do about this problem! :(
Sergio Tapia
Sounds like your problem is actually related to cookies, you're proably being redirected and your script isn't providing keys needed to maintain your session. Alternatively there is some javascript involved in the sorting or you haven't correct URL-encoded your parameters. If you want to persue this further it might be helpful to ask as a new question since this one, as worded, has been answered.
SpliFF
A: 

Try adding a user agent header so it thinks you are one of the major browsers (IE, FF, etc)

client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
Matt Wrock
A: 

Try printing out the WebClient's headers - maybe there is something strange in there by default that could be cluing the site in to not being a browser?

Ben
+2  A: 

Maybe I'm missing something, but following code ran without problems:

Regex torrents = new Regex(
    @"<tr>[\s\S]*?<td><a href=""(?<link>.*?)"".*?>" + 
    @"(?<name>.*?)</a>[\s\S]*?<td><a href=""(?<torrent>.*?)""[\s\S]*?>" + 
    @"(?<size>\d+\.?\d*)&nbsp;(?<unit>.)iB</td>");
Uri url = new Uri("http://thepiratebay.org/search/documentary/0/7/200");

WebClient client = new WebClient();
string html = client.DownloadString(url);
//string html = Encoding.Default.GetString(client.DownloadData(url));

foreach (Match torrent in torrents.Matches(html))
{
    Console.WriteLine("{0} ({1:0.00}{2}b)", 
        torrent.Groups["name"].Value, 
        Double.Parse(torrent.Groups["size"].Value), 
        torrent.Groups["unit"].Value);
    Console.WriteLine("\t{0}", 
        new Uri(url, torrent.Groups["link"].Value).LocalPath);
    Console.WriteLine("\t{0}",
        new Uri(torrent.Groups["torrent"].Value).LocalPath);
}
Rubens Farias
Yes, testcase error.
Henk Holterman
@Henk, what's wrong?
Rubens Farias
"testcase error" means the error in the question failed to be reproduced.
Henk Holterman
@Henk, misunderstood, again; ty
Rubens Farias
A: 

HTTP is a textual protocol that is very human-readable! Connect to the site using telnet and type in the HTTP requests by hand. This allows you full control over the user-agent string and other associated information. It's also dead simple.

When you get this working by hand, you should be able to add this functionality to your app with some very basic socket programming.

More: http://en.wikipedia.org/wiki/Hypertext%5FTransfer%5FProtocol

I'd post links to the RFC and the Wikipedia page on the user-agent string, but I just joined.

foxostro