Someone told me in case of int instead of char: because char in java is just 2 bytes length, which is OK with most character symbols already in use, but for certain character symbols (chinese or whatever), the character is being represented in more than 2 bytes, and hence we use int instead.
Assuming that at this point you are talking specifically about the Reader.read()
method, the statement from "someone" that you have recounted is in fact incorrect.
It is true that some Unicode codepoints have values greater than 65535 and therefore cannot be represented as a single Java char
. However, theReader
API actually produces a sequence of Java char
values (or -1), not a sequence of Unicode codepoints. This clearly stated in the javadoc.
If your input includes a (suitably encoded) Unicode code point that is greater than 65535, then you will actually need to call the read()
method twice to see it. What you will get will be a UTF-16 surrogate pair; i.e. two Java char
values that together represent the codepoint. In fact, this fits in with the way that the Java String, StringBuilder and StringBuffer classes all work; they all use a UTF-16 based representation ... with embedded surrogate pairs.
The real reason that Reader.read()
returns an int
not a char
is to allow it to return -1
to signal that there are no more characters to be read. The same logic explains why InputStream.read()
returns an int
not a byte
.
Hypothetically, I suppose that the Java designers could have specified that the read()
methods throw an exception to signal the "end of stream" condition. However, that would have just replaced one potential source of bugs (failure to test the result) with another (failure to deal with the exception). Besides, exceptions are relatively expensive, and an end of stream is not really an unexpected / exceptional event. In short, the current approach is better, IMO.
(Another clue to the 16 bit nature of the Reader
API is the signature of the read(char[], ...)
method. How would that deal with codepoints greater than 65535 if surrogate pairs weren't used?)
EDIT
The case of DataOutputStream.writeChar(int)
does seem a bit strange. However, the javadoc clearly states that the argument is written as a 2 byte value. And in fact, the implementation clearly writes only the bottom two bytes to the underlying stream.
I don't think that there is a good reason for this. Anyway, there is a bug database entry for this (4957024), which marked as "11-Closed, Not a Defect" with the following comment:
"This isn't a great design or excuse, but it's too baked in for us to change."
... which is kind of an an acknowledgement that it is a defect, at least from the design perspective.
But this is not something worth making a fuss about, IMO.