Hi. I am building a site that need to scrape information from a partner site. Now my scraping code works great with other sites but not this one. It is a regular .html site. My thoughts is that it might be generated some how with php (site is build with php).
I have no idea I am just taking a guess about the generated part and I would need your pros help on this. If it matters here is my code I use. The htmlDocument is htmlAgilityPack but that has nothing to do with it. Result is null on the site I try.
string result;
var objRequest = System.Net.HttpWebRequest.Create(strUrl);
var objResponse = objRequest.GetResponse();
using (var sr = new StreamReader(objResponse.GetResponseStream()))
{
result = sr.ReadToEnd();
sr.Close();
var doc = new HtmlDocument();
doc.LoadHtml(result);
foreach (var c in doc.DocumentNode.SelectNodes("//a[@href]"))
{
litStatus.Text += c.Attributes["href"].Value + "<br />";
}
}
EDIT:
this is from the w3 validator, might have something with this?
Sorry, I am unable to validate this document because on line 422 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\xA9" does not map to Unicode