views:

2319

answers:

7

We've got a page which posts data to our ASP.NET app in ISO-8859-1

<head>
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
    <title>`Sample Search Invoker`</title>
</head>
<body>

<form name="advancedform" method="post" action="SearchResults.aspx">
    <input class="field" name="SearchTextBox" type="text" />
    <input class="button" name="search" type="submit" value="Search &gt;" />
</form>

and in the code behind (SearchResults.aspx.cs)

System.Collections.Specialized.NameValueCollection postedValues = Request.Form;
String nextKey;
for (int i = 0; i < postedValues.AllKeys.Length; i++)
{
    nextKey = postedValues.AllKeys[i];

    if (nextKey.Substring(0, 2) != "__")
    {
        // Get basic search text
        if (nextKey.EndsWith(XAEConstants.CONTROL_SearchTextBox))
        {
            // Get search text value
            String sSentSearchText = postedValues[i];

            System.Text.Encoding iso88591 = System.Text.Encoding.GetEncoding("iso-8859-1");
            System.Text.Encoding utf8 = System.Text.Encoding.UTF8;

            byte[] abInput = iso88591.GetBytes(sSentSearchText);

            sSentSearchText = utf8.GetString(System.Text.Encoding.Convert(iso88591, utf8, abInput));

            this.SearchText = sSentSearchText.Replace('<', ' ').Replace('>',' ');
            this.PreviousSearchText.Value = this.SearchText;
        }
    }
}

When we pass through Merkblätter it gets pulled out of postedValues[i] as Merkbl�tter The raw string string is Merkbl%ufffdtter

Any ideas?

+1  A: 

I think adding your encoding into web.config like that will probably solve your problem :

<configuration>
   <system.web>
      <globalization
           fileEncoding="iso-8859-1"
           requestEncoding="iso-8859-1"
           responseEncoding="iso-8859-1"
           culture="en-US"
           uiCulture="en-US"
        />
   </system.web>
</configuration>
Canavar
yeh, that is an option i had considered but there are other issues with doing that unfortunately...
Gordon Carpenter-Thompson
A: 

That's because you are encoding the string as ISO-8859-1 and decoding it as if it was a string encoded as UTF-8. This will surely mess up the data.

The form isn't posting the data as ISO-8859-1 just because you send the page using that encoding. You haven't specified any encoding for the form data, so the browser will choose an encoding that is capable of handling the data in the form. It may choose ISO-8859-1, but it may just as well choose some other encoding.

The data is send to the server, where it's decoded and put in the Request.Form collection, according to the encoding that the browser specifies.

All you have to do is to read the string that has already been decoded from the Request.Form collection. You don't have to loop through all the items in the collection either, as you already know the name of the text box.

Just do:

string sentSearchText = Request.Form("SearchTextBox");
Guffa
"The form isn't posting the data as ISO-8859-1 at all." I don't think that is true,browsers use the Content-Type header of the received HTML to determine what encoding it will use to post the content of a form.
AnthonyWJones
Hmm, how do I post the form as ISO-8859-1?Thanks for the comment on the Request.Form stuff, this is inherited code and it worked so I never looked into fixing it..
Gordon Carpenter-Thompson
Use accept-charset="ISO-8859-1" in the form tag to specify the encoding.
Guffa
@Guffa: The problem is that the post is going as ISO-8859-1 already, even with this explicit accept-charset attribute the server still doesn't know what the encoding of the incoming request is. The data is sent as application/x-www-form-urlencoded which a) doesn't carry charset (because its application/* data) and b.) the only sensible value would be US-ASCII because thats the encoding used in url encoding.
AnthonyWJones
Its what happens to the character octets during url decoding where things are getting messed up. The server assumes that once the %xx byte values are resolved the complete set of bytes for each name and value in the set be treated as UTF-8. The only place that this particular server behaviour can be modified is web.config (according Canavar I haven't checked that myself).
AnthonyWJones
If the server is decoding the data as UTF-8, you should use that in your form: accept-charset="UTF-8".
Guffa
+3  A: 

You have this line of code:-

String sSentSearchText = postedValues[i];

The decoding of octets in the post has happen here.

The problem is that META http-equiv doesn't tell the server about the encoding.

You could just add RequestEncoding="ISO-8859-1" to the @Page directive and stop trying to fiddle around with the decoding yourself (since its already happened).

That doesn't help either. It seems you can only specify the Request encoding in the web.config.

Better would be to stop using ISO-8859-1 altogether and leave it with the default UTF-8 encoding. I can see no gain and only pain with using a restrictive encoding.

Edit

If it seems that changing the posting forms encoding is not a possibility then we seem to be left with no alternative than to handle the decoding ourselves. To that end include these two static methods in your receiving code-behind:-

private static NameValueCollection GetEncodedForm(System.IO.Stream stream, Encoding encoding)
{
 System.IO.StreamReader reader = new System.IO.StreamReader(stream, Encoding.ASCII);
 return GetEncodedForm(reader.ReadToEnd(), encoding);
}


private static NameValueCollection GetEncodedForm(string urlEncoded, Encoding encoding)
{
 NameValueCollection form = new NameValueCollection();
 string[] pairs = urlEncoded.Split("&".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

 foreach (string pair in pairs)
 {
  string[] pairItems = pair.Split("=".ToCharArray(), 2, StringSplitOptions.RemoveEmptyEntries);
  string name = HttpUtility.UrlDecode(pairItems[0], encoding);
  string value = (pairItems.Length > 1) ? HttpUtility.UrlDecode(pairItems[1], encoding) : null;
  form.Add(name, value);
 }
 return form;
}

Now instead of assigning:-

postedValues = Request.Form;

use:-

postValues = GetEncodedForm(Request.InputStream, Encoding.GetEncoding("ISO-8859-1"));

You can now remove the encoding marlarky from the rest of the code.

AnthonyWJones
setting the input page to be UTF-8 would be my ideal option; the form is embedded in a customer site however and they don't seem to want to change the encoding to UTF-8 so I'm investigating alternatives.Why is encoding such a ballache, i'd happily hunt down and have stern words with the people who came up with this mess if i had the resources :-)
Gordon Carpenter-Thompson
Encoding isn't a problem in ASP.NET its very simple. __Leave encoding alone, don't touch it, the default UTF-8 works fine__.
AnthonyWJones
in an ideal world i would be using UTF-8 but alas it's not that easy in this app....
Gordon Carpenter-Thompson
A: 

What I ended up doing was forcing our app to be in ISO-8859-1. Unfortunately the underlying data may contain characters which don't fit nicely into that codepage so we go through the data before displaying it and convert everything about the character code of 127 into an entity. Not ideal but works for us...

Gordon Carpenter-Thompson
A: 

Hi, I had the same problem, solved like this:

  System.Text.Encoding iso_8859_2 = System.Text.Encoding.GetEncoding("ISO-8859-2");
  System.Text.Encoding utf_8 = System.Text.Encoding.UTF8;

  NameValueCollection n = HttpUtility.ParseQueryString("RT=A+v%E1s%E1rl%F3+nem+enged%E9lyezte+a+tranzakci%F3t", iso_8859_2);
  Response.Write(n["RT"]);

A+v%E1s%E1rl%F3+nem+enged%E9lyezte+a+tranzakci%F3t will return "A vásárló nem engedélyezte a tranzakciót" as expected.

balint
A: 
Function urlDecode(input)
 inp = Replace(input,"/","%2F")
 set conn = Server.CreateObject("MSXML2.ServerXMLHTTP")
 conn.setOption(2) = SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
 conn.open "GET", "http://www.neoturk.net/urldecode.asp?url=" & inp, False
 conn.send ""
 urlDecode = conn.ResponseText
End Function

To speed this up, just create a table on your db for decoded and encoded urls and read them on global.asa application.on_start section. Later put them on the application object. Then put a check procedure for that application obj. in above function and IF decoded url not exists on app array, THEN request it one time from remote page (tip: urldecode.asp should be on different server see: http://support.microsoft.com/default.aspx?scid=kb;en-us;Q316451) and insert it to your db and append to application array object, ELSE return the function from the application obj.

This is the best method I have ever found. If anybody wants further details on application object, database operations etc. contact me via [email protected]

You can see above method successfully working at: lastiktestleri.com/Home

I also used, HeliconTech's ISAPI_Rewrite Lite version usage is simple: url = Request.ServerVariables("HTTP_X_REWRITE_URL") this will return the exact url directed to /404.asp

neoturk.net
A: 

We had the same problem that you have. The topic is not straight-forward at all.

The first tip is to set the Response encoding of the page that posts the data (usually the same page as the one that receives the data in .NET) to the desired form post encoding.

However, this is just a hint to the user's browser on how to interpret the characters sent from the server. The user might choose to override the encoding manually. And, if the user overrides the encoding of the page, the encoding of the data sent in the form is also changed (to whatever the user has set the encoding to).

There is a small trick, though. If you add a hidden field with the name _charset_ (notice the underscores) in your form, most browsers will fill out this form field with the name of the charset used when posting the form. This form field is also a part of the HTML5 specification.

So, you might think your're good to go, however, when in your page, ASP.NET has already urldecoded all parameters sent in to the form. So when you actually have the value in the _charset_ field, the value of the field containing Merkblätter is already decoded incorrectly by .NET.

You have two options:

  1. In the ASP.NET page in question, perform the parsing of the request string manually
  2. In Application_BeginRequest, in Global.asax, parse the request parameters manually, extracting the _charset_field. When you get the value, set Request.ContentEncoding to System.Text.Encoding.GetEncoding(<value of _charset_ field>). If you do this, you can read the value of the field containing Merkblätter as usual, no matter what charset the client sends the value in.

In either of the cases above, you need to manually read Request.InputStream, to fetch the form data. I would recommend setting the Response Encoding to UTF-8 to have the greatest number of options in which characters you accept, and then treating the special cases when the user has overridden the charset especially, as specified above.

Erik A. Brandstadmoen