tags:

views:

459

answers:

4

is there a way i can convert a .txt file into unicode by using c#?

+3  A: 

Only if you know the original encoding used to produce the .txt file (and that's not a restriction of C# or the .NET language either, it's a general problem).

Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) to learn why "plain text" is meaningless if you don't know the encoding.

Joachim Sauer
thank u very much
intrinsic
Joachim: *Fantastic* article from Joel, thanks for the link. I now have it in my armory of links to redistribute liberally to those what seem to need 'em... Cheers. - T.J.
T.J. Crowder
+3  A: 

Provided you're only using ASCII characters in your text file, they're already Unicode, encoded as UTF-8.

In you want a different encoding of the characters (UTF16/UCS2, etc), any language that supports Unicode should be able to read in one encoding and write out another.

The System.Text.Encoding stuff will do it as per the following example - it outputs UTF16 as both UTF8 and ASCII and then back again (code gratuitously stolen from here).

using System;
using System.IO;
using System.Text;

class Test {
    public static void Main() {        
        using (StreamWriter output = new StreamWriter("practice.txt")) {
            string srcString = "Area = \u03A0r^2"; // PI.R.R

            // Convert the UTF-16 encoded source string to UTF-8 and ASCII.
            byte[] utf8String = Encoding.UTF8.GetBytes(srcString);
            byte[] asciiString = Encoding.ASCII.GetBytes(srcString);

            // Write the UTF-8 and ASCII encoded byte arrays. 
            output.WriteLine("UTF-8  Bytes: {0}",
                BitConverter.ToString(utf8String));
            output.WriteLine("ASCII  Bytes: {0}",
                BitConverter.ToString(asciiString));

            // Convert UTF-8 and ASCII encoded bytes back to UTF-16 encoded  
            // string and write.
            output.WriteLine("UTF-8  Text : {0}",
                Encoding.UTF8.GetString(utf8String));
            output.WriteLine("ASCII  Text : {0}",
                Encoding.ASCII.GetString(asciiString));

            Console.WriteLine(Encoding.UTF8.GetString(utf8String));
            Console.WriteLine(Encoding.ASCII.GetString(asciiString));
        }
    }
}
paxdiablo
thank u so much for ur help.i figured out i m quite illetrate in unicode and encoding stuff.cheers :)
intrinsic
+1 - great answer
Russ Cam
@intrinsic, the vast majority of people are illiterate with regards to Unicode, especially those that think they're not :-) I only discovered how really complex it is in the last couple of years (we now ship software which is localized to twenty-plus different major locales and even more minor ones).
paxdiablo
@Pax,amazing :)i wanted the .txt to unicode thing for my semester project(making a lexical analyizer for c#)actually we are trying many things to get the job done.thanx again
intrinsic
A: 

If you do really need to change the encoding (see Pax's answer about UTF-8 being valid Unicode), then yes, you can do that quite easily. Check out the System.Text.Encoding class.

T.J. Crowder
A: 

There is a nice page on MSDN about this, including a whole example:

   // Specify the code page to correctly interpret byte values
    Encoding encoding = Encoding.GetEncoding(737); //(DOS) Greek code page
    byte[] codePageValues = System.IO.File.ReadAllBytes(@"greek.txt");

    // Same content is now encoded as UTF-16
    string unicodeValues = encoding.GetString(codePageValues);
Stefan Steinegger