ansaurus

Question

Any thoughts on why I can't scrape a site?

Answer 1

+2 A:

I would start by seeing what response I got from something simple like wget or using a tool like Fiddler to test the response and check any headers you are getting back.

Sometimes sites will return different responses from different agent strings and so on, so you may need to adjust your request headers and masquerade as a different browser to get the data you are looking for. If you are using Fiddler on the same machine that is running the script you should be able to see exactly what is different between a request for the page from your browser and a request for the page from your script.

There may even be a simple 302 redirect or something like that going on that your code isn't following.

If you can access the page with a browser then you will definitely be able to access it by sending exactly the same request as your browser would send.

Edit- Fiddler is slightly trickier to use from your own code because it behaves as a proxy- it sets itself up with regular browsers, but you would manually have to tell your code to run through a proxy on 127.0.0.1 port 8888 in order for Fiddler to see your results.

glenatron 2010-01-18 12:49:33

i got firebug but i dont know what to look in the header for? srry

Dejan.S 2010-01-18 13:03:16

This is why Fiddler is useful because what you want is to find the _difference_ between the request your asp.net script is making and the request your browser is making. Headers in each direction will be useful - you could probably find these from the request and response objects in your asp.net code.

glenatron 2010-01-18 14:22:42

i got the fiddler running now and i watched some videos on it but i dont get any different request's, not from what i can see. I would be so greatfull if you just took one minute and see it they are different. U dont have to but I would appriciate it. addy ishttp://www.raggarportalen.se/Kalender.html

Dejan.S 2010-01-18 15:18:55

Right, so if you type the url into fiddler using the "Request Builder" tab you will get the page. Then try deleting the User-Agent from the request there. You will get an empty response. It doesn't matter what the User-Agent is, but you have to have a User-Agent set in the request.

glenatron 2010-01-18 16:44:14

Answer 2

+1 A:

To troubleshoot, check the value of objResponse.StatusCode and objResponse.StatusDescription:

string result;
var objRequest = System.Net.HttpWebRequest.Create(strUrl);
var objResponse = (System.Net.HttpWebResponse) objRequest.GetResponse();

Console.WriteLine(objResponse.StatusCode);
Console.WriteLine(objResponse.StatusDescription);
...

codeape 2010-01-18 12:50:21

i can not access objResponse.StatusCode and objResponse.StatusDescription.

Dejan.S 2010-01-18 12:53:29

Why not? I assume you have tried something like: ``Console.WriteLine(objResponse.StatusCode);`` just before your using statement. What happens when you try that? Does the code not compile? Does it crash at runtime? What value is printed?

codeape 2010-01-18 12:57:30

objResponse o not contain that option, StatusCode.

Dejan.S 2010-01-18 12:59:58

I see, it is an abstract WebResponse, I guess. Change the "var objResponse = " line to "var objResponse = (HttpWebResponse) objRequest.GetResponse();"

codeape 2010-01-18 13:03:52

Dejan.S 2010-01-18 13:07:16

What does it look like if you browse to ``strUrl`` in a web browser? Make 100% sure you try the same URL as your code tries, copy/paste from a debugger view or something. And double check that your browser does not add anything to the request URL (like a missing trailing slash).

codeape 2010-01-18 13:12:13

I have checked the strUrl several times it is correct.

Dejan.S 2010-01-18 13:16:45

It is a bit unclear whether the result variable is null after the result = sr.ReadToEnd(); statement. Is this the case?

codeape 2010-01-18 13:16:53

Yes it is still null after the sr.Close();

Dejan.S 2010-01-18 13:21:18

Answer 3

A:

this is from the w3 validator, might have something with this?

Sorry, I am unable to validate this document because on line 422 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

The error was: utf8 "\xA9" does not map to Unicode

Dejan.S 2010-01-18 12:57:03

Hmmm, seems your page encoding is set to UTF8 (which implies that sequence is not valid). Try a default encoding eg Latin1 instead.

leppie 2010-01-18 13:12:46

When you have more info on the original question, it is generally better to edit the question. I have copied and pasted into the original question.

codeape 2010-01-18 13:14:57

I dont really know how to encode the site in Latin1 cause it is from the partner site. Any ideas?@codeape - I will do that from now, thanks

Dejan.S 2010-01-18 13:19:31

Answer 4

+1 A:

The problem appears to be the character in the comment on line 421:

<!-- KalenderMx v1.4 � by shiba-design.de -->

which is outside of the declared character encoding iso-8859-1:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

You might try running the parsed document string through a filter to convert or remove the offending characters in the string before evaluating it with the htmlAgilityPack LoadHtml().

Mads Hansen 2010-01-18 14:18:59

My problem is before i can do anything with htmlAgilityPack. My result is null when i scrape the site. I bet it has to do with the comment but i dont know how to solve it

Dejan.S 2010-01-18 14:38:53

Answer 5

A:

There might be a possibility that the site doesn't allow any scraping to be done. Have a check with the site.

Bob 2010-02-23 05:36:18

ansaurus

tags:

views:

answers:

Any thoughts on why I can't scrape a site?

related questions