



I'm trying to read UTF-8 from a text file and do some tokenization, but I'm having issues with the encoding:

try {
    fis = new FileInputStream(fName);
} catch (FileNotFoundException ex) {

DataInputStream myInput = new DataInputStream(fis);
    try {
     while (thisLine = myInput.readLine()) != null) {
      StringTokenizer st = new StringTokenizer(thisLine, ";");
      while (st.hasMoreElements()) {
         // do something with st.nextToken();
} catch (Exception e) {

and DataInputStream doesn't have any parameters to set the encoding!

+1  A: 

Why not use InputStreamReader and specify the encoding ? You can then wrap with a BufferedReader to provide the readLine() capability.

Brian Agnew
+4  A: 

You can use InputStreamReader:

BufferedReader br = new BufferedReader (new InputStreamReader (source, charset);
while (br.readLine () != null) { ... }

You can also try Scanner, but I'm not sure that it would work fine

+5  A: 

Let me quote the Javadoc for this method.


Deprecated. This method does not properly convert bytes to characters. As of JDK 1.1, the preferred way to read lines of text is via the BufferedReader.readLine() method. Programs that use the DataInputStream class to read lines can be converted to use the BufferedReader class by replacing code of the form:

     DataInputStream d = new DataInputStream(in);


     BufferedReader d
          = new BufferedReader(new InputStreamReader(in));

BTW: JDK 1.1 came out in Feb 1997 so this shouldn't be new to you.

Just think how much time everyone would have saved if you had read the Javadoc. ;)

Peter Lawrey

When you are reading text (not binary data) you should use a Reader (not an InputStream). You can than specify the encoding for the vm by doing -Dfile.encoding=utf-8. The Reader will automatically use this encoding. So you could even easily switch the encoding. You can use BufferedReader on FileReader to have a readLine(). The method readLine() has only meaning when reading text otherwise the line endings are just bytes

Norbert Hartl
Changing the default encoding via the command line (-Dfile.encoding=...) is OK for small utilities, but can have unwanted side-effects for interactions with the system - affecting System.out, for example.
To me it sounded like a little utility. So you gain a lot of flexibility by letting java do the magic. You are right that it is not a good idea to switch encoding on a bigger application but having hardcoded encodings throughout your code is not much better. And not specifying file.encoding which leads to the effect that it is taken from the system doesn't save you from side effects either
Norbert Hartl

One very simple way:

File myFile = ...
List<String> lines = FileUtils.readLines(myFile, "UTF-8");

for (String line : lines) {
    StringTokenizer st = new StringTokenizer(line, ";");
    while (st.hasMoreElements()) {
        // do something with st.nextToken();

Where FileUtils is from Apache Commons IO.

Know and use the libraries, as I've become fond of saying, it seems. (Of course, reading the whole file at once may not suit all situations.)

Yeah, sadly Commons IO doesn't support generics, so compiler would warn about the assignment to List<String>.

StringTokenizer is an extremely simple class for text tokenization, I can only recommend it for tasks that do not need to further identify the tokens (i.e. using a dictionary lookup) and that will only be used for western languages.

For more advanced cases regarding western languages, a simple tokenizer can be written based on unicode character classes (this will pick up many kinds of whitespace, delimiting characters etc.) and then extended using regexes to catch special cases (like 'that's', 'C++'...).
