ansaurus

Question

finding unicode for non-english characters

Answer 1

A:

python -c "print repr('text goes here'.decode('utf-8'))"

It may not always be 'utf-8', but that is a sane starting point.

Ignacio Vazquez-Abrams 2010-01-17 06:53:15

Java? Python? ?_?

KennyTM 2010-01-17 07:23:00

Well... I just happen to know how to do it in Python...

Ignacio Vazquez-Abrams 2010-01-17 07:26:56

tried this :python -c "print repr(u'श्रावण')"this was the output I got:u'\xe0\xa4\xb6\xe0\xa4\xb6\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xbe\xe0\xa4\xb5\xe0\xa4\xa3'(Note:When I try string.decode('utf-8') in the above command, I get an error which says ascii codec cannot decode characters in positions 0-17.)I copied the result obtained and pasted to my java program. I could not use it as such (the string was getting printed as such and not as corresponding characters).I then translated each entity in the result from the form \xab to \u00ab. After i did this, the result was nowhere near..

Aadith 2010-01-17 07:49:52

..the original string. any idea whats wrong? what do i have to do to print the original string from my program using the generated unicode?any help would be great.

Aadith 2010-01-17 07:51:04

Drop the `u` at the beginning of the string literal if you use `.decode()`. The first output you got is the string in UTF-8.

Ignacio Vazquez-Abrams 2010-01-17 07:52:30

tried it without 'u' at the beginning. The command worked this time. But when I paste it in my java program and try to print the, it just prints a series of question marks..do i have to do anything more than mere pasting the code as a string literal?

Aadith 2010-01-17 08:08:33

That I don't know. But it may just be that your terminal/console can't print out Devanagari characters and so is doing an "invalid character" replacement.

Ignacio Vazquez-Abrams 2010-01-17 08:12:55

thanks ignacio..changing the encoding for the source file got it working

Aadith 2010-01-17 08:26:40

Answer 2

A:

The code example at the end of this post might help

Mihir Mathuria 2010-01-17 07:25:40

Answer 3

+3 A:

In which codepage do you have that string? Java sources can be in any encoding, so you can put that string right in the source and use compiler's options to set the code page. See NetBeans -> Project node -> Properties -> Source -> Encoding.

Ondra Žižka 2010-01-17 07:41:55

I am using Eclipse over Mac. The source files were getting encoded using "MacRoman" (found this from Project Properties -> Resource -> Text file encoding). I changed it to "UTF-8" and then tried embedding the actual non-english string to the program and tried printing. it worked. Can you please explain the underlying concept?

Aadith 2010-01-17 08:17:01

Answer 4

+1 A:

As previous answers said, you can definitely write strings containing characters that can't be encoded in conventional ISO-8859-1 or US-ASCII characters sets, directly in the source file. You do need to make sure your IDE saves the file as UTF-8. And, you may need to add "-encoding UTF-8" to your javac command to ensure javac reads it correctly.

But I think you're wondering about how to embed the string using "\uXXXX" syntax, perhaps to avoid any issues of the source file encoding. This short code snippet will probably work for you; it crudely assumes any character whose UTF-16 values is over 255 needs to be escaped.

public static void main(String[] args) {
  String s = args[0];
  for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    int value = (int) c;
    if (value < 256) {
      System.out.print(c);
    } else {
      System.out.print("\\u" + Integer.toHexString(value));
    }
  }
}

Sean Owen 2010-01-17 12:30:31

@Sean Owen - I would opt for any value over U+007F (127).

McDowell 2010-01-17 14:21:49

Answer 5

+2 A:

The source files were getting encoded using "MacRoman" (found this from Project Properties -> Resource -> Text file encoding). I changed it to "UTF-8" and then tried embedding the actual non-english string to the program and tried printing. it worked.

You were perhaps corrupting data either on save or during compilation. Source code doesn't carry any intrinsic encoding information, so it is easy to corrupt string literals that contain characters outside the basic "ASCII" range. Consider using Unicode escape sequences in your source files to avoid this problem. You either do that or you ensure that anyone who comes into contact with the source handles it appropriately at all times - the first way is easier.

If this is for a commercial application, consider externalizing the strings to a resource file.

McDowell 2010-01-17 14:19:36

I appreciate your point, but I am at a loss to fully understand: what problems do you exactly see with embedding the unicode string? I thought this would be more legible...people touching the code would immediately see hat they are doing, which wouldnt be the case with unicode escape sequences.by the way, this is for a fun project. and thanks for those links

Aadith 2010-01-17 17:27:54

For a range of (mostly) English alphabet characters, the byte values of encoded characters are the same for a file encoded as UTF-8, MacRoman or Windows-1252. But, a character like À (`\u00C0`) will be stored as the bytes `C3 80`, `CB` and `C0` respectively. Copying the file to another PC will require you to document the encoding so that other people can edit and compile the code correctly. That isn't a huge problem - and may be the best approach if you're writing a non-English application.

McDowell 2010-01-17 18:19:52

ansaurus

tags:

views:

answers:

finding unicode for non-english characters

related questions