views:

93

answers:

1

I've got some very basic code like

while (scan.hasNextLine())
{
    String temp = scan.nextLine();
    System.out.println(temp);
}

where scan is a Scanner over a file.

However, on one particular line, which is about 6k chars long, temp cuts out after something like 2470 characters. There's nothing special about when it cuts out; it's in the middle of the word "Australia." If I delete characters from the line, the place where it cuts out changes; e.g. if I delete characters 0-100 in the file then Scanner will get what was previously 100-2570.

I've used Scanner for larger strings before. Any idea what could be going wrong?

+6  A: 

At a guess, you may have a rogue character at the cut-off point: look at the file in a hex editor instead of just a text editor. Perhaps there's an embedded null character, or possibly \r in the middle of the string? It seems unlikely to me that Scanner.nextLine() would just chop it arbitrarily.

As another thought, are you 100% sure that it's not all there? Perhaps System.out.println is chopping the string - again due to some "odd" character embedded in it? What happens if you print temp.length()?

EDIT: I'd misinterpreted the bit about what happens if you cut out some characters. Sorry about that. A few other things to check:

  • If you read the lines with BufferedReader.readLine() instead of Scanner, does it get everything?
  • Are you specifying the right encoding? I can't see why this would show up in this particular way, but it's something to think about...
  • If you replace all the characters in the line with "A" (in the file) does that change anything?
  • If you add an extra line before this line (or remove a line before it) does that change anything?

Failing all of this, I'd just debug into Scanner.nextLine() - one of the nice things about Java is that you can debug into the standard libraries.

Jon Skeet
It's definitely not all there when I print out the length.For context, this is a .csv file exported from Excel that I'm editing in vim. I don't think there's any special characters in there; as I said, if I delete characters, the cut off point changes. So while it cuts off in the middle of "Australia", if I delete a hundred characters somewhere before "Australia", "Australia" and the next ~90 characters after it print just fine.The same thing happens on the next line, only it cuts off at 112 rather than 2470. These are the only two lines that don't work. Some of the lines are longer.
Ventrue
Just took a look at it in a hex editor and it's fine, just ascii values. The second line cuts out between a 't' and an apostrophe.
Ventrue
@Ventrue: LOL - I'd *just* added an edit to resuggest using a hex editor. Hmm. I've added a few other suggestions - but the "debugging into it" may turn out to be what you need...
Jon Skeet
Oh boy, it was the charset. It was reading ASCII, the file is ISO-LATIN. Thanks so much.
Ventrue
@Ventrue: Woot! Admittedly that's a pretty odd failure mode - was the apostrophe a non-ASCII one?
Jon Skeet
@Venture - `Scanner` tends to swallow errors, but you can check for them using the `ioException()` method.
McDowell