tags:

views:

156

answers:

6

I have a piece of test equipment, from which I can read data using an InputStream, which intersperses bytes and characters (organized into lines), e.g.:

TEST1
TEST2
500
{500 binary bytes follows here}
TEST3
TEST4
600
{600 binary bytes follows here}

I'd like to use BufferedReader so I can read a line at a time, but then switch to InputStream so I can read the binary bytes. But this neither seems to work nor seems like a good idea.

How can I do this? I can't get bytes from a BufferedReader, and if I use a BufferedReader on top of an InputStream, it seems like the BufferedReader "owns" the InputStream.

Edit: the alternative, just using an InputStream everywhere and having to convert bytes->characters and look for newlines, seems like it would definitely work but would also be a real pain.

+1  A: 

When using BufferedReader, you can just use String#getBytes() to get the bytes out of a String line. Don't forget to take character encoding into account. I recommend using UTF-8 all the time.

Just for your information: from the other side, if you only have bytes and you want to construct the characters, just use new String(bytes). Also don't forget to take the character encoding into account here.

[Edit] after all, it's a better idea to use BufferedInputStream and construct a byte buffer for a single line (fill until the byte matches the linebreak) and test if the character representation of its start matches with one of the predefined strings.

BalusC
that helps, but what if the byte string gets interpreted badly by bufferedReader? e.g. if BufferedReader uses 2-byte (16-bit) characters, and there's an odd # of bytes, and BufferedReader hangs because it's trying to read an extra byte that doesn't exist?
Jason S
Then construct InputStreamReader with proper encoding, like UTF-8.
BalusC
This is a really terrible idea. Do not treat arbitrary bytes as UTF-8 data (how much data is read and transformed to UTF-8 by the buffer). Not all byte values are valid UTF-8. If a byte sequence is not valid, it will silently be replaced.
McDowell
As is mixing arbitray bytes with characters and newlines.
BalusC
@BalusC - many binary formats do that (including the Java class format). But trying to treat binary data as character data is a mistake. Decoding operations (like those performed by a `Reader`) transform data, often irreversibly. I think we get a lot of this from C, where `char` and `octet` were often interchangeable.
McDowell
Truly a good point. Better would then be to use BufferedInputStream and construct a byte buffer for a single line and test if its start matches with one of the predefined strings.
BalusC
except how do you construct a byte buffer for a single line? You can't, you have to decode the input text character by character. If you just look at it as bytes and search for, say, a \n, there's the possibility of finding misaligned bytes that look like \n.
Jason S
A: 

I think I'm going to take a stab at using java.nio.ByteBuffer and ByteBuffer.asCharBuffer, which looks promising. Still have to look for newlines manually but at least it looks like it will handle the character translation properly.

Jason S
Jason S
It would only have worked anyway if your text was UTF-16 encoded.
McDowell
A: 

BufferedReader has read(char[] cbuf, int off, int len) can't you use that, convert chars to bytes and wrap it with ByteArrayInputStream?

EDIT: why would someone downvote that? Give a comment please. This works perfectly fine:

    ByteArrayOutputStream bos = new ByteArrayOutputStream();

    try {
        bos.write("TEST1\n".getBytes());
        bos.write("10\n".getBytes());
        for (int i = 0; i < 10; i++)
            bos.write(i);
        bos.write("TEST2\n".getBytes());
        bos.write("1\n".getBytes());
        bos.write(25);

        ByteArrayInputStream bis = new ByteArrayInputStream(bos.toByteArray());
        BufferedReader br = new BufferedReader(new InputStreamReader(bis));

        while (br.ready()) {
            String s = br.readLine();
            String num = br.readLine();
            int len = Integer.valueOf(num);
            System.out.println(s + ", reading " + len + " bytes");
            char[] cbuf = new char[len];
            br.read(cbuf);
            byte[] bbuf = new byte[len];
            for (int i = 0; i < len; i++)
                bbuf[i] = (byte) cbuf[i];
            for (byte b: bbuf)
                System.out.print(b + " ");
            System.out.println();
        }
    } catch (IOException e) {
        e.printStackTrace();
    }

Output:

TEST1, reading 10 bytes
0 1 2 3 4 5 6 7 8 9 
TEST2, reading 1 bytes
25
tulskiy
`Reader` classes transform data. All the data here passes through a `BufferedReader`. On an Ubuntu system (default charset UTF-8), the arbitrary byte sequence `c2 a3 c2 a3` would become the char values `00A3 00A3` (4 bytes becomes 2 chars). Decoding the bytes `80 81` using windows-1252 (default on English Windows) becomes the char values `20ac fffd`. Even decoding as US-ASCII will lose data because ASCII only uses the first 7bits of an octet.
McDowell
OK, I see the problem. Choosing windows1252 as encoding corrupts bytes from 128 to 160.
tulskiy
Oops - in the interests of accuracy, what I should have said was that `InputStreamReader` transforms data. @Pilgrim - pretty much - the data is transformed to UTF-16; and windows-1252 values above 127 have different values in UTF-16.
McDowell
A: 

Take a look at the source code of LineNumberInputStream. The class itself has been deprecated, but it looks like this is exactly what you need here.

This class allows you to read byte lines and then use regular InputStream read methods.

If you don't want to drag deprecated code into your system just borrow some implementation details from it.

Alexander Pogrebnyak
"Deprecated. This class incorrectly assumes that bytes adequately represent characters."
Jason S
A: 

I don't have a good answer for the general case (so other answers are welcome), but if I assume input is ISO-8859-1 (8-bit chars) the following works for me, although I guess casting an 8-bit byte as char doesn't necessarily guarantee ISO-8859-1 either.

The existing InputStream.read(byte[] b) and InputStream.read(byte[] b, int ofs, int len) allows me to read bytes.

public class OctetCharStream extends InputStream {
    final private InputStream in;
    static final private String charSet = "ISO-8859-1";

    public OctetCharStream(InputStream in)
    {
     this.in=in;
    }

    @Override public int read() throws IOException {
     return this.in.read();
    }

    public String readLine() throws IOException
    {
     StringBuilder sb = new StringBuilder();
     while (true)
     {
      /*
       *  cast from byte to char: 
       *  fine for 8-byte character sets
       *  but not good in general 
       */
      char c = (char) read();
      if (c == '\n')
       break;   
      sb.append(c);
     }
     return sb.toString();
    }
    public String readCharacters(int n) throws IOException
    {
     byte[] b = new byte[n];
     int i = read(b);
     String s = new String(b, 0, i, charSet);
     return s;
    }
}

Interestingly, when I tried using InputStreamReader alone rather than wrapping BufferedReader around it, the InputStreamReader.read() still buffers to some extent, by reading "greedily" more than one character even if you just want to pull out one character. So I couldn't use InputStreamReader to wrap an InputStream and try to use both the InputStream and InputStreamReader to read bytes/characters according to which one I needed at the moment.

Jason S
+1  A: 

Instead of using a Reader and InputStream and attempting to switch back and forth between the two, try using a callback interface with one method for binary data and another for character data. e.g.

interface MixedProcessor {
    void processBinaryData(byte[] bytes, int off, int len);
    void processText(String line);
}

Then have another "splitter" class that:

  • Decides which sections of the input are text and which are binary, and passes them to the corresponding processor method
  • Converts bytes to characters when required (with the help of a CharsetDecoder)

The splitter class might look something like this:

class Splitter {
    public Splitter(Charset charset) { /* ... */ }
    public void readFully(InputStream is, MixedProcessor processor) throws IOException  { /* ... */ }
}
finnw
hmm. interesting. this is like SAX instead of StAX for XML (push-processing vs. pull-processing). Callbacks would add significant amounts of complication to my specific application, but in general it might be useful.
Jason S