tags:

views:

202

answers:

3

Hi, I am working on some scraping app, i wanted to try to get it to work but ran into a problem. I have replaced the original scraping destination in the below code with googles webpage, just for testing. It seems that my download doesnt get everything, i note that the body and the html tags are missing their close tags. How do i get it to download everything? Whats wrong with my sample code:

string filename = "test.html";

WebClient client = new WebClient();            
string searchTerm = HttpUtility.UrlEncode(textBox2.Text);            
client.QueryString.Add("q", searchTerm);
client.QueryString.Add("hl", "en");
string data = client.DownloadString("http://www.google.com/search");

StreamWriter writer = new StreamWriter(filename, false, Encoding.Unicode);
writer.Write(data);
writer.Flush();
writer.Close();
A: 

...Google's page doesn't have the closing tags for <body> and <html>. Talk about crazy optimization...

Matti Virkkunen
A: 

http://www.google.com/search doesn't have closing tags.

Marcelo Cantos
+3  A: 

Google's web pages are now in HTML 5, meaning the BODY and HTML tags can be self-closed - which is why Google omits them (believe it or not, it saves them bandwidth.)

See this article.

You can write HTML5 in either "HTML/SGML" mode (which allows the omitting of closing tags like HTML did prior to XHTML) or in "XHTML" which follows the rules of XML, requiring all tags to be closed.

Which the browser chooses to parse the page depends on whether you send a Content-type header of text/html for HTML/SGML syntax or application/xhtml+xml for XHTML syntax. (Source: http://stackoverflow.com/questions/1076897/html5-syntax-html-vs-xhtml)

Andy Shellam