views:

424

answers:

2

I am "screen scraping" my own pages as a temporary hack, using .NET's WebRequest.

This works well, but accented characters and diacritical characters do not translate correctly.

I am wondering if there is a way to make them translate correctly using .NET's many many built in properties and methods.

Here is the code I am using to grab the pages:

private string getArticle(string urlToGet)
{

    StreamReader oSR = null;

    //Here's the work horse of what we're doing, the WebRequest object 
    //fetches the URL
    WebRequest objRequest = WebRequest.Create(urlToGet);

    //The WebResponse object gets the Request's response (the HTML) 
    WebResponse objResponse = objRequest.GetResponse();

    //Now dump the contents of our HTML in the Response object to a 
    //Stream reader
    oSR = new StreamReader(objResponse.GetResponseStream());


    //And dump the StreamReader into a string...
    string strContent = oSR.ReadToEnd();

    //Here we set up our Regular expression to snatch what's between the 
    //BEGIN and END
    Regex regex = new Regex("<!-- content_starts_here //-->((.|\n)*?)<!-- content_ends_here //-->",
        RegexOptions.IgnoreCase);

    //Here we apply our regular expression to our string using the 
    //Match object. 
    Match oM = regex.Match(strContent);

    //Bam! We return the value from our Match, and we're in business. 
    return oM.Value;
}
+1  A: 

try using:

System.Net.WebClient client = new System.Net.WebClient();
string html = client.DownloadString(urlToGet);
string decoded = System.Web.HttpUtility.HtmlDecode(html);

also, check out client.Encoding

Igor Soarez
A: 

There's another way to handle that, using the second parameter of the StreamReader constructor, like this:

new StreamReader(webRequest.GetResponse().GetResponseStream(), 
                 Encoding.GetEncoding("ISO-8859-1"));

That would make it.

Tucif