views:

325

answers:

4

hello all

I am working on a tcp/ip socket listener which listens on port 80 for data that arrives from remote hosts. Now these incoming data are in unreadable format and so i have saved this incoming data as they are in a string initially and then converted this string to a character array and then for every index in the array , I have converted the content to hex. Now the problem is that The data is getting converted to hex alright, but in some places the conversion is not proper and the resulting hex part is 'fffd'. is in the place where the resulting hex should be 'bc'(0xBC), it is 'fffd'(0xFF 0xFD). I am forced to believe that some parts of the incoming data are not being read properly by my java program. Im using BufferefInputStream and InputStreamReader for reading the incoming data and am checking the end of stream in the following way.

  BufferedInputStream is = new BufferedInputStream(connection.getInputStream());
  InputStreamReader isr = new InputStreamReader(is);
  while(isr.read()!=-1)

 {
 ...
}

where 'connection' is a socket object.

The input data that im getting through the socket is #SR,IN-0002005,10:49:37,16/01/2010, $<49X ™™š@(bN>™™šBB ©: 4ä ýÕ 01300>ÀäCåKöA÷Л.

The hex conversion that my program does has 'fffd' at many places where other hex values should be. The conversion, though is correct for around 60% of the input string

Any pointers on why my resulting hex conversion is not what it should be would be of great help.

+4  A: 

I don't think you should be using a reader. Readers are for reading characters, you seem to be working with binary data. Use the InputStream directly and transform the bytes as you receive them. chars in java are Unicode-characters, which I am guessing is the source of your issues.

Thomas Lötzer
If its not a problem, could you please help me with a small snippet demonstrating the case?
ping
@ping From the code in your question, just remove the line where you create the InputStreamReader and replace all references to that Reader by references to the InputStream, e.g. `while(isr.read()!=-1)` becomes `while(is.read()!=-1)`. Though you probably will need to store the return value of read somewhere, because that is the read byte, e.g. `while((nextByte = is.read())!=-1)`
Thomas Lötzer
+2  A: 

Java Strings are not as easy to "abuse" for handling transparent binary data as it is in VB (or most other languages). VB treats strings internally as an array of bytes, while in Java, Strings are an ordered list of characters.

In your case, you wrap your InputStream with an InputStreamReader causing your platform's default character encoding to be used when converting the bytes delivered from the InputStream to characters delivered by the InputStreamReader. Some of the mostly used ISO 8859-X character sets are not using bytes in the ranges 0x00 to 0x1f and 0x7f to 0xbf, so if you are using such an encoding and reading a byte from those ranges, the InputStreamReader will return the "replacement character" with codepoint 0xfffd to indicate an unknown character.

The only "correct" way is to leave out the InputStreamReader and use byte arrays for the binary data.

jarnbjo
A: 

When converting bytes to chars with an InputStreamReader, the encoding makes a huge difference:

  public static void main(String[] args) throws Exception {
    checkEncoding("ISO-8859-1");
    checkEncoding("ISO-8859-9");
    checkEncoding("Windows-1252");
    checkEncoding("UTF-8");
    checkEncoding("UTF-16BE");
    checkEncoding("Big5");
    checkEncoding("Shift-JIS");
  }

  private static void checkEncoding(String encoding) throws IOException {
    byte[] all = new byte[256];
    for ( int i = 0; i < all.length; ++i ) all[i] = (byte) i;
    ByteArrayInputStream bais = new ByteArrayInputStream(all);
    InputStreamReader isr = new InputStreamReader(bais, encoding);
    char[] ca = new char[256];
    int read = isr.read(ca);
    System.out.println(encoding + ":" + read);
    for ( int i = 0; i < read; ++i ) {
      if ( ca[i] != i ) {
        System.out.println(Integer.toHexString(i) + "->" + 
            Integer.toHexString(ca[i]));
      }
    }
  }

The only one that works "as expected" is ISO-8859-1, which is defined to be the first 256 chars in Unicode. ISO-8859-9 and Windows-1252 also produce chars 1-for-1; 8859-9 has a few different characters, but 1252 has several 0xFFFDs.

Because of the way the bytes are arranged, everything after 0x7F for UTF-8 is no good. Of course, you get half the chars for UTF-16, and the other multi-byte encodings are a mess.

Ken
A: 

For development purposes look at the one in Eclipse already for use with those web containers with server connectors.

Thorbjørn Ravn Andersen