views:

305

answers:

5

Hi. I am building a site that need to scrape information from a partner site. Now my scraping code works great with other sites but not this one. It is a regular .html site. My thoughts is that it might be generated some how with php (site is build with php).

I have no idea I am just taking a guess about the generated part and I would need your pros help on this. If it matters here is my code I use. The htmlDocument is htmlAgilityPack but that has nothing to do with it. Result is null on the site I try.

        string result;
        var objRequest = System.Net.HttpWebRequest.Create(strUrl);
        var objResponse = objRequest.GetResponse();

        using (var sr = new StreamReader(objResponse.GetResponseStream()))
        {
            result = sr.ReadToEnd();
            sr.Close();

            var doc = new HtmlDocument();
            doc.LoadHtml(result);                

            foreach (var c in doc.DocumentNode.SelectNodes("//a[@href]"))
            {
                litStatus.Text += c.Attributes["href"].Value + "<br />";
            }
        }

EDIT:

this is from the w3 validator, might have something with this?

Sorry, I am unable to validate this document because on line 422 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

The error was: utf8 "\xA9" does not map to Unicode

+2  A: 

I would start by seeing what response I got from something simple like wget or using a tool like Fiddler to test the response and check any headers you are getting back.

Sometimes sites will return different responses from different agent strings and so on, so you may need to adjust your request headers and masquerade as a different browser to get the data you are looking for. If you are using Fiddler on the same machine that is running the script you should be able to see exactly what is different between a request for the page from your browser and a request for the page from your script.

There may even be a simple 302 redirect or something like that going on that your code isn't following.

If you can access the page with a browser then you will definitely be able to access it by sending exactly the same request as your browser would send.

Edit- Fiddler is slightly trickier to use from your own code because it behaves as a proxy- it sets itself up with regular browsers, but you would manually have to tell your code to run through a proxy on 127.0.0.1 port 8888 in order for Fiddler to see your results.

glenatron
i got firebug but i dont know what to look in the header for? srry
Dejan.S
This is why Fiddler is useful because what you want is to find the _difference_ between the request your asp.net script is making and the request your browser is making. Headers in each direction will be useful - you could probably find these from the request and response objects in your asp.net code.
glenatron
i got the fiddler running now and i watched some videos on it but i dont get any different request's, not from what i can see. I would be so greatfull if you just took one minute and see it they are different. U dont have to but I would appriciate it. addy ishttp://www.raggarportalen.se/Kalender.html
Dejan.S
Right, so if you type the url into fiddler using the "Request Builder" tab you will get the page. Then try deleting the User-Agent from the request there. You will get an empty response. It doesn't matter what the User-Agent is, but you have to have a User-Agent set in the request.
glenatron
+1  A: 

To troubleshoot, check the value of objResponse.StatusCode and objResponse.StatusDescription:

string result;
var objRequest = System.Net.HttpWebRequest.Create(strUrl);
var objResponse = (System.Net.HttpWebResponse) objRequest.GetResponse();

Console.WriteLine(objResponse.StatusCode);
Console.WriteLine(objResponse.StatusDescription);
...
codeape
i can not access objResponse.StatusCode and objResponse.StatusDescription.
Dejan.S
Why not? I assume you have tried something like: ``Console.WriteLine(objResponse.StatusCode);`` just before your using statement. What happens when you try that? Does the code not compile? Does it crash at runtime? What value is printed?
codeape
objResponse o not contain that option, StatusCode.
Dejan.S
I see, it is an abstract WebResponse, I guess. Change the "var objResponse = " line to "var objResponse = (HttpWebResponse) objRequest.GetResponse();"
codeape
Dejan.S
What does it look like if you browse to ``strUrl`` in a web browser? Make 100% sure you try the same URL as your code tries, copy/paste from a debugger view or something. And double check that your browser does not add anything to the request URL (like a missing trailing slash).
codeape
I have checked the strUrl several times it is correct.
Dejan.S
It is a bit unclear whether the result variable is null after the result = sr.ReadToEnd(); statement. Is this the case?
codeape
Yes it is still null after the sr.Close();
Dejan.S
A: 

this is from the w3 validator, might have something with this?

Sorry, I am unable to validate this document because on line 422 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

The error was: utf8 "\xA9" does not map to Unicode

Dejan.S
Hmmm, seems your page encoding is set to UTF8 (which implies that sequence is not valid). Try a default encoding eg Latin1 instead.
leppie
When you have more info on the original question, it is generally better to edit the question. I have copied and pasted into the original question.
codeape
I dont really know how to encode the site in Latin1 cause it is from the partner site. Any ideas?@codeape - I will do that from now, thanks
Dejan.S
+1  A: 

The problem appears to be the character in the comment on line 421:

<!-- KalenderMx v1.4 � by shiba-design.de -->

which is outside of the declared character encoding iso-8859-1:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

You might try running the parsed document string through a filter to convert or remove the offending characters in the string before evaluating it with the htmlAgilityPack LoadHtml().

Mads Hansen
My problem is before i can do anything with htmlAgilityPack. My result is null when i scrape the site. I bet it has to do with the comment but i dont know how to solve it
Dejan.S
A: 

There might be a possibility that the site doesn't allow any scraping to be done. Have a check with the site.

Bob