ansaurus

Question

How to Generate all the characters in the UTF-8 charset in .net

Answer 1

+1 A:

UTF-8 isn't a character set - it's a character encoding which is capable of encoding any character in the Unicode character set into binary data.

Could you give more information about what you're trying to do? You could encode all the possible Unicode characters (including ones which aren't allocated at the moment) although if you need to cope with characters outside the basic multilingual plane (i.e. those above U+FFFF) then it becomes slightly trickier...

Jon Skeet 2009-11-03 16:47:44

Answer 2

+2 A:

UTF-8 is not a charset, it's an encoding. Any value in Unicode can be encoded in UTF-8 with different byte lengths.

For .net, the characters are 16-bit (it's not the complete set of unicode but is the most practical), so you can try this:

 for (char i = 0; i < 65536; i++) {
     string s = "" + i;
     byte[] bytes = Encoding.UTF8.GetBytes(s);
     // do something with bytes
 }

yuku 2009-11-03 16:48:05

Your code is correct, but your second paragraph is misleading. `System.Char` is a 16-bit value, true. But MSDN makes it clear that a `System.Char` is a UTF-16 code point, which means that it is not technically a character. There are plenty of Unicode characters that can be represented in UTF-8 that have code points above 65536. You say "it's not the complete set of unicode it is the most practical" -- I'm not certain that's true, and it's certainly not a good reason to avoid testing code points above U+FFFF.

Daniel Pryden 2009-11-03 17:03:19

Answer 3

+3 A:

There is no "UTF-8 characters". Do you mean Unicode characters or UTF-8 encoding of Unicode characters?

It's easy to convert an int to a Unicode character, provided of course that there is a mapping for that code:

char c = (char)theNumber;

If you want the UTF-8 encoding for that character, that's not very hard either:

byte[] encoded = Encoding.UTF8.GetBytes(c.ToString())

You would have to check the Unicode standard to see the number ranges where there are Unicode characters defined.

Guffa 2009-11-03 16:51:11

Answer 4

A:

As other people have said, UTF / Unicode is an encoding not a character set.

If you skim though http://www.joelonsoftware.com/articles/Unicode.html it should help clarify what unicode is.

Kragen 2009-11-03 16:51:57

Answer 5

+4 A:

Even once you generate all the characters, you'll find it's not an effective test. Some of the characters are combining marks, which means they will combine with the next character to come after them - having a string full of combining marks won't make much sense. There are other special cases too. You'll be much better off using actual text in the languages you need to support.

Mark Ransom 2009-11-03 16:52:04

Answer 6

A:

System.Net.WebClient client = new System.Net.WebClient();
string definedCodePoints = client.DownloadString(
                         "http://unicode.org/Public/UNIDATA/UnicodeData.txt");
System.IO.StringReader reader = new System.IO.StringReader(definedCodePoints);
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
while(true) {
  string line = reader.ReadLine();
  if(line == null) break;
  int codePoint = Convert.ToInt32(line.Substring(0, line.IndexOf(";")), 16);
  if(codePoint >= 0xD800 && codePoint <= 0xDFFF) {
    //surrogate boundary; not valid codePoint, but listed in the document
  } else {
    string utf16 = char.ConvertFromUtf32(codePoint);
    byte[] utf8 = encoder.GetBytes(utf16);
    //TODO: something with the UTF-8-encoded character
  }
}

The above code should iterate over the currently assigned Unicode characters. You'll probably want to parse the UnicodeData file locally and fix any C# blunders I've made.

The set of currently assigned Unicode characters is less than the set that could be defined. Of course, whether you see a character when you print one of them out depends on a great many other factors, like fonts and the other applications it'll pass through before it is emitted to your eyeball.

McDowell 2009-11-03 22:34:15

ansaurus

tags:

views:

answers:

How to Generate all the characters in the UTF-8 charset in .net

related questions