ansaurus

Question

Reading data from UTF-8 text file and tokenize

Answer 1

+1 A:

Why not use InputStreamReader and specify the encoding ? You can then wrap with a BufferedReader to provide the readLine() capability.

Brian Agnew 2009-05-06 19:20:41

Answer 2

+4 A:

You can use InputStreamReader:

BufferedReader br = new BufferedReader (new InputStreamReader (source, charset);
while (br.readLine () != null) { ... }

You can also try Scanner, but I'm not sure that it would work fine

Roman 2009-05-06 19:20:45

Answer 3

+5 A:

Let me quote the Javadoc for this method.

DataInputStream.readLine()

Deprecated. This method does not properly convert bytes to characters. As of JDK 1.1, the preferred way to read lines of text is via the BufferedReader.readLine() method. Programs that use the DataInputStream class to read lines can be converted to use the BufferedReader class by replacing code of the form:

     DataInputStream d = new DataInputStream(in);

with:

     BufferedReader d
          = new BufferedReader(new InputStreamReader(in));

BTW: JDK 1.1 came out in Feb 1997 so this shouldn't be new to you.

Just think how much time everyone would have saved if you had read the Javadoc. ;)

Peter Lawrey 2009-05-06 19:27:21

Answer 4

A:

When you are reading text (not binary data) you should use a Reader (not an InputStream). You can than specify the encoding for the vm by doing -Dfile.encoding=utf-8. The Reader will automatically use this encoding. So you could even easily switch the encoding. You can use BufferedReader on FileReader to have a readLine(). The method readLine() has only meaning when reading text otherwise the line endings are just bytes

Norbert Hartl 2009-05-06 19:28:54

Changing the default encoding via the command line (-Dfile.encoding=...) is OK for small utilities, but can have unwanted side-effects for interactions with the system - affecting System.out, for example.

McDowell 2009-05-06 19:38:24

To me it sounded like a little utility. So you gain a lot of flexibility by letting java do the magic. You are right that it is not a good idea to switch encoding on a bigger application but having hardcoded encodings throughout your code is not much better. And not specifying file.encoding which leads to the effect that it is taken from the system doesn't save you from side effects either

Norbert Hartl 2009-05-06 20:06:10

Answer 5

A:

One very simple way:

File myFile = ...
List<String> lines = FileUtils.readLines(myFile, "UTF-8");

for (String line : lines) {
    StringTokenizer st = new StringTokenizer(line, ";");
    while (st.hasMoreElements()) {
        // do something with st.nextToken();
    }
}

Where FileUtils is from Apache Commons IO.

Know and use the libraries, as I've become fond of saying, it seems. (Of course, reading the whole file at once may not suit all situations.)

Jonik 2009-05-06 19:45:27

Yeah, sadly Commons IO doesn't support generics, so compiler would warn about the assignment to List<String>.

Jonik 2009-05-06 19:48:52

Answer 6

A:

StringTokenizer is an extremely simple class for text tokenization, I can only recommend it for tasks that do not need to further identify the tokens (i.e. using a dictionary lookup) and that will only be used for western languages.

For more advanced cases regarding western languages, a simple tokenizer can be written based on unicode character classes (this will pick up many kinds of whitespace, delimiting characters etc.) and then extended using regexes to catch special cases (like 'that's', 'C++'...).

pudo 2009-05-06 19:59:25

ansaurus

tags:

views:

answers:

Reading data from UTF-8 text file and tokenize

related questions