ansaurus

Question

Why is this appearing in my c# strings: Â£

Answer 1

+7 A:

AppendAllText is writing out the text in UTF-8.

What are you using to look at it? Chances are it's something that doesn't understand UTF-8, or doesn't try UTF-8 first. Tell your editor/viewer that it's a UTF-8 file and all should be well. Alternatively, use the overload of AppendAllText which allows you to specify the encoding and use whichever encoding is going to be most convenient for you.

EDIT: In response to your edited question, the reason it fails when you encode with ASCII is that £ is not in the ASCII character set (which is Unicode 0-127).

URL encoding is also using UTF-8, by the looks of it. Again, if you want to use a different encoding, specify it to the HttpUtility.UrlEncode overload which accepts an encoding.

Jon Skeet 2009-03-30 10:09:51

Answer 2

A:

Note that %a3 cannot be encoded in ASCII (7 bit, Basic Latin).

The Pound Sign (down the page) is part of Latin-1 encoding.

gimel 2009-03-30 10:35:03

Answer 3

+2 A:

The default character set of URLs when used in HTML pages and in HTTP headers is called ISO-8859-1 or ISO Latin-1.

It's not the same as UTF-8, and it's not the same as ASCII, but it does fit into one-byte-per-character. The range 0 to 127 is a lot like ASCII, and the whole range 0 to 255 is the same as the range 0000-00FF of Unicode.

So you can generate it from a C# string by casting each character to a byte, or you can use Encoding.GetEncoding("iso-8859-1") to get an object to do the conversion for you.

(In this character set, the UK pound symbol is 163.)

Background

The RFC says that unencoded text must be limited to the traditional 7-bit US ASCII range, and anything else (plus the special URL delimiter characters) must be encoded. But it leaves open the question of what character set to use for the upper half of the 8-bit range, making it dependent on the context in which the URL appears.

And that context is defined by two other standards, HTTP and HTML, which do specify the default character set, and which together create a practically irresistable force on implementers to assume that the address bar contains percent-encodings that refer to ISO-8859-1.

ISO-8859-1 is the character set of text-based content sent via HTTP except where otherwise specified. So by the time a URL string appears in the HTTP GET header, it ought to be in ISO-8859-1.

The other factor is that HTML also uses ISO-8859-1 as its default, and URLs typically originate as links in HTML pages. So when you craft a simple minimal HTML page in Notepad, the URLs you type into that file are in ISO-8859-1.

It's sometimes described as "hole" in the standards, but it's not really; it's just that HTML/HTTP fill in the blank left by the RFC for URLs.

Hence, for example, the advice on this page:

URL encoding of a character consists of a "%" symbol, followed by the two-digit hexadecimal representation (case-insensitive) of the ISO-Latin code point for the character.

(ISO-Latin is another name for IS-8859-1).

So much for the theory. Paste this into notepad, save it as an .html file, and open it in a few browsers. Click the link and Google should search for UK pound.

<HTML>
  <BODY>
    <A href="http://www.google.com/search?q=%a3"&gt;Test&lt;/A&gt;
  </BODY>
</HTML>

It works in IE, Firefox, Apple Safari, Google Chrome - I don't have any others available right now.

Daniel Earwicker 2009-03-30 10:39:01

This solved my problem perfectly. Just needed to put the iso-8859-1 encoding on my UrlEncode.

Tim Saunders 2009-03-30 10:46:23

Do you have a source for the info about the default character encoding of URLs? I thought it was one of those annoyingly unspecified things. I'm not disputing it, but I'd like to see where it's specified as the default. Btw, you can also use Encoding.GetEncoding(28591) to get ISO-8859-1.

Jon Skeet 2009-03-30 10:47:46

I'm interested why URLEncode doesn't do this conversion automatically. As the strings in C# are UTF-8 so it would be fairly intuitive to have the URLEncode methods accept such a string and encode it correctly. Rather than falling over unless I manually specify the correct encoding?

Tim Saunders 2009-03-30 10:59:24

@tsaunders: No, the strings aren't UTF-8. They're UTF-16. UrlEncode is using UTF-8 by default. As I said, I don't think the URL spec specifies the encoding to be used for UrlEncoded characters, which is a pain.

Jon Skeet 2009-03-30 11:26:38

@Jon Skeet - added some background; it's not in the URL spec, but it is in the HTML and HTTP specs.

Daniel Earwicker 2009-03-30 11:52:11

Are C# strings in UTF-16? Note that 16-bit characters was enough for Unicode version 1, but later versions require more. The 16-bit subset is called UCS-2. UTF-16 can use *multiple* 16-bit codes to represent one character, just as UTF-8 does with 8-bit codes.

Daniel Earwicker 2009-03-30 12:08:23

i.e. if C# strings are in UTF-16, then the 'char' data type is possibly misnamed! :) It might only be half a character.

Daniel Earwicker 2009-03-30 12:09:14

It looks like (as with Win32 generally) it depends on what API the strings are passed to, and how Windows is configured. http://www.catch22.net/tuts/neatpad/11

Daniel Earwicker 2009-03-30 13:26:56

.NET encodes non-BMP characters using surrogate pairs - i.e. it uses UTF-16. Fortunately most C# developers don't usually come across this. This is exactly what Java does too.

Jon Skeet 2009-03-30 13:53:21

From the C# spec, section 1.3: "The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units."

Jon Skeet 2009-03-30 13:55:36

@Earwicker: I've been looking in RFC 2616 and 2396 btw, and I still can't see anything defining ISO-8859-1 as the default URL encoding. Could you give a spec/RFC reference?

Jon Skeet 2009-03-30 13:58:48

They don't say that. They just specify ISO-8859-1 in every context where a default encoding is discussed (you can use £ unencoded in HTML and HTTP), such that it would be perverse of browser authors to choose any other for URLs - hence "a practically irresistable force on implementers to assume..."

Daniel Earwicker 2009-03-30 14:32:43

Where do they say that though? I'm failing to find any reference to ISO-8859-1 in any spec I look at... Who is "they"?

Jon Skeet 2009-03-30 14:38:27

That's odd. Can you see the string "ISO-8859-1" in this page? http://www.rfc-editor.org/rfc/rfc2616.txt - I count 9.

Daniel Earwicker 2009-03-30 14:57:21

Yup, excellent - I think I must have been looking at a version which was split up into sections, and not looking at the right section...

Jon Skeet 2009-03-30 15:04:28

(It's still not exactly well written, IMO, but at least it's pinned down to some extent. :)

Jon Skeet 2009-03-30 15:07:38

By the standard of some RFCs these are masterpieces! :)

Daniel Earwicker 2009-03-30 15:09:17

Irrespective of whether C# strings are UTF-8 or UTF-16 it is still not clear to me why URLEncode would be implemented in such a way that it doesn't work in its default state with strings defined in C#. Surely it would be more intuitive for it to encode correctly?

Tim Saunders 2009-03-30 16:11:15

URLs' encoding of Unicode characters is not ISO-8559-1, it is undefined. This unfortunately means there is no ‘right’ answer to URL-encoding, but these days more and more web services are converging on using UTF-8, so it's definitely a reasonable default.

bobince 2009-03-30 16:20:19

For URLs generated by a web app, inserted in an HTML page as links and then sent back by the browser, they can use whatever character set they want. But for URLs generated from web form submission by the browser, you have to go with what browsers do.

Daniel Earwicker 2009-03-30 16:49:48

What Browsers Do is dependent on the settings, but more likely UTF-8 than ISO-8859-1. The Google example above only works because Google picks up non-UTF-8 sequences and re-interprets them; if you use the URL “http://www.google.com/search?q=%C2%a3” you do get a search for a single ‘£’ and not ‘Â£’.

bobince 2009-03-31 13:31:10

ansaurus

tags:

views:

answers:

Why is this appearing in my c# strings: Â£

related questions