views:

212

answers:

6

Hi Folks,

Why some methods that write bytes/chars to streams takes int instead of byte/char??

Someone told me in case of int instead of char: because char in java is just 2 bytes length, which is OK with most character symbols already in use, but for certain character symbols (chines or whatever), the character is being represented in more than 2 bytes, and hence we use int instead.

How far this explanation is close to the truth?

EDIT: I use the stream word to represent Binary and character streams (not Just Binary streams)

Thanks.

A: 

It's correct that the maximum possible code point is 0x10FFFF, which doesn't fit in a char. However, the stream methods are byte-oriented, while the writer methods are 16-bit. OutputStream.write(int) writes a single byte, and Writer.write(int) only looks at the low-order 16 bits.

Matthew Flaschen
please see my edit
Mohammed
please see my comment
fuzzy lollipop
+3  A: 

I'm not sure exactly what you're referring to but perhaps you are thinking of InputStream.read()? It returns an integer instead of a byte because the return value is overloaded to also represent end of stream, which is represented as -1. Since there are 257 different possible return values a byte is insufficient.

Otherwise perhaps you could come with some more specific examples.

Mark Byers
A: 

In Java, Streams are for raw bytes. To write characters, you wrap a Stream in a Writer.

While Writers do have write(int) (which writes the 16 low bits; it's an int because byte is too small, and short is too small due to it being signed), you should be using write(char[]) or write(String) instead.

R. Bemrose
A: 

probably to be symmetric with the read() method which returns an int. nothing serious.

irreputable
+2  A: 

There are a few possible explanations.

First, as a couple of people have noted, it might be because read() necessarily returns an int, and so it can be seen as elegant to have write() accept an int to avoid casting:

int read = in.read();
if ( read != -1 )
   out.write(read);
//vs
   out.write((byte)read);

Second, it might just be nice to avoid other cases of casting:

//write a char (big-endian)
char c;
out.write(c >> 8);
out.write(c);

//vs
out.write( (byte)(c >> 8) );
out.write( (byte)c );
Mark Peters
+4  A: 

Someone told me in case of int instead of char: because char in java is just 2 bytes length, which is OK with most character symbols already in use, but for certain character symbols (chinese or whatever), the character is being represented in more than 2 bytes, and hence we use int instead.

Assuming that at this point you are talking specifically about the Reader.read() method, the statement from "someone" that you have recounted is in fact incorrect.

It is true that some Unicode codepoints have values greater than 65535 and therefore cannot be represented as a single Java char. However, theReader API actually produces a sequence of Java char values (or -1), not a sequence of Unicode codepoints. This clearly stated in the javadoc.

If your input includes a (suitably encoded) Unicode code point that is greater than 65535, then you will actually need to call the read() method twice to see it. What you will get will be a UTF-16 surrogate pair; i.e. two Java char values that together represent the codepoint. In fact, this fits in with the way that the Java String, StringBuilder and StringBuffer classes all work; they all use a UTF-16 based representation ... with embedded surrogate pairs.

The real reason that Reader.read() returns an int not a char is to allow it to return -1 to signal that there are no more characters to be read. The same logic explains why InputStream.read() returns an int not a byte.

Hypothetically, I suppose that the Java designers could have specified that the read() methods throw an exception to signal the "end of stream" condition. However, that would have just replaced one potential source of bugs (failure to test the result) with another (failure to deal with the exception). Besides, exceptions are relatively expensive, and an end of stream is not really an unexpected / exceptional event. In short, the current approach is better, IMO.

(Another clue to the 16 bit nature of the Reader API is the signature of the read(char[], ...) method. How would that deal with codepoints greater than 65535 if surrogate pairs weren't used?)

EDIT

The case of DataOutputStream.writeChar(int) does seem a bit strange. However, the javadoc clearly states that the argument is written as a 2 byte value. And in fact, the implementation clearly writes only the bottom two bytes to the underlying stream.

I don't think that there is a good reason for this. Anyway, there is a bug database entry for this (4957024), which marked as "11-Closed, Not a Defect" with the following comment:

"This isn't a great design or excuse, but it's too baked in for us to change."

... which is kind of an an acknowledgement that it is a defect, at least from the design perspective.

But this is not something worth making a fuss about, IMO.

Stephen C
Don't forget that both the designers and the original intended audience of Java were used to C, so they made read() be as similar to fgetc() as they could get away with.
Licky Lindsay
Umm ... I disagree. If they had wanted to make it slavishly similar, they would have called the method `getC` or something. I'm sure that the designers were *informed* by the C libraries, but there are *lots* of indications that they did not set out to *imitate* them.
Stephen C
Your explanation for why is the return type of `read` methods is very glue.So, Could you please explain why the DataOutputStream#writeChar method takes an integer?http://java.sun.com/javase/6/docs/api/java/io/DataOutputStream.html#writeChar%28int%29
Mohammed
Don't forget that read() has to return an out of band value as well as in band values. This is the true explanation. Your guesswork about fgetc notwithstanding.
EJP
Any update Mr Stephen C?
Mohammed
I answered your question about DataOutputStream#writeChar; see EDIT, was there something else you were expecting?
Stephen C