ansaurus

Question

Java: interspersing bytes and characters

Answer 1

+1 A:

When using BufferedReader, you can just use String#getBytes() to get the bytes out of a String line. Don't forget to take character encoding into account. I recommend using UTF-8 all the time.

Just for your information: from the other side, if you only have bytes and you want to construct the characters, just use new String(bytes). Also don't forget to take the character encoding into account here.

[Edit] after all, it's a better idea to use BufferedInputStream and construct a byte buffer for a single line (fill until the byte matches the linebreak) and test if the character representation of its start matches with one of the predefined strings.

BalusC 2009-11-04 19:03:05

that helps, but what if the byte string gets interpreted badly by bufferedReader? e.g. if BufferedReader uses 2-byte (16-bit) characters, and there's an odd # of bytes, and BufferedReader hangs because it's trying to read an extra byte that doesn't exist?

Jason S 2009-11-04 19:09:48

Then construct InputStreamReader with proper encoding, like UTF-8.

BalusC 2009-11-04 20:04:47

This is a really terrible idea. Do not treat arbitrary bytes as UTF-8 data (how much data is read and transformed to UTF-8 by the buffer). Not all byte values are valid UTF-8. If a byte sequence is not valid, it will silently be replaced.

McDowell 2009-11-04 20:06:08

As is mixing arbitray bytes with characters and newlines.

BalusC 2009-11-04 20:11:58

@BalusC - many binary formats do that (including the Java class format). But trying to treat binary data as character data is a mistake. Decoding operations (like those performed by a `Reader`) transform data, often irreversibly. I think we get a lot of this from C, where `char` and `octet` were often interchangeable.

McDowell 2009-11-04 20:27:05

Truly a good point. Better would then be to use BufferedInputStream and construct a byte buffer for a single line and test if its start matches with one of the predefined strings.

BalusC 2009-11-04 20:40:44

except how do you construct a byte buffer for a single line? You can't, you have to decode the input text character by character. If you just look at it as bytes and search for, say, a \n, there's the possibility of finding misaligned bytes that look like \n.

Jason S 2009-11-04 20:46:15

Answer 2

A:

I think I'm going to take a stab at using java.nio.ByteBuffer and ByteBuffer.asCharBuffer, which looks promising. Still have to look for newlines manually but at least it looks like it will handle the character translation properly.

Jason S 2009-11-04 19:12:24

Jason S 2009-11-04 19:38:06

It would only have worked anyway if your text was UTF-16 encoded.

McDowell 2009-11-04 20:00:56

Answer 3

A:

BufferedReader has read(char[] cbuf, int off, int len) can't you use that, convert chars to bytes and wrap it with ByteArrayInputStream?

EDIT: why would someone downvote that? Give a comment please. This works perfectly fine:

    ByteArrayOutputStream bos = new ByteArrayOutputStream();

    try {
        bos.write("TEST1\n".getBytes());
        bos.write("10\n".getBytes());
        for (int i = 0; i < 10; i++)
            bos.write(i);
        bos.write("TEST2\n".getBytes());
        bos.write("1\n".getBytes());
        bos.write(25);

        ByteArrayInputStream bis = new ByteArrayInputStream(bos.toByteArray());
        BufferedReader br = new BufferedReader(new InputStreamReader(bis));

        while (br.ready()) {
            String s = br.readLine();
            String num = br.readLine();
            int len = Integer.valueOf(num);
            System.out.println(s + ", reading " + len + " bytes");
            char[] cbuf = new char[len];
            br.read(cbuf);
            byte[] bbuf = new byte[len];
            for (int i = 0; i < len; i++)
                bbuf[i] = (byte) cbuf[i];
            for (byte b: bbuf)
                System.out.print(b + " ");
            System.out.println();
        }
    } catch (IOException e) {
        e.printStackTrace();
    }

Output:

TEST1, reading 10 bytes
0 1 2 3 4 5 6 7 8 9 
TEST2, reading 1 bytes
25

tulskiy 2009-11-04 19:45:43

`Reader` classes transform data. All the data here passes through a `BufferedReader`. On an Ubuntu system (default charset UTF-8), the arbitrary byte sequence `c2 a3 c2 a3` would become the char values `00A3 00A3` (4 bytes becomes 2 chars). Decoding the bytes `80 81` using windows-1252 (default on English Windows) becomes the char values `20ac fffd`. Even decoding as US-ASCII will lose data because ASCII only uses the first 7bits of an octet.

McDowell 2009-11-04 22:55:11

OK, I see the problem. Choosing windows1252 as encoding corrupts bytes from 128 to 160.

tulskiy 2009-11-04 23:20:39

Oops - in the interests of accuracy, what I should have said was that `InputStreamReader` transforms data. @Pilgrim - pretty much - the data is transformed to UTF-16; and windows-1252 values above 127 have different values in UTF-16.

McDowell 2009-11-05 11:46:06

Answer 4

A:

Take a look at the source code of LineNumberInputStream. The class itself has been deprecated, but it looks like this is exactly what you need here.

This class allows you to read byte lines and then use regular InputStream read methods.

If you don't want to drag deprecated code into your system just borrow some implementation details from it.

Alexander Pogrebnyak 2009-11-04 19:50:23

"Deprecated. This class incorrectly assumes that bytes adequately represent characters."

Jason S 2009-11-04 20:47:13

Answer 5

A:

I don't have a good answer for the general case (so other answers are welcome), but if I assume input is ISO-8859-1 (8-bit chars) the following works for me, although I guess casting an 8-bit byte as char doesn't necessarily guarantee ISO-8859-1 either.

The existing InputStream.read(byte[] b) and InputStream.read(byte[] b, int ofs, int len) allows me to read bytes.

public class OctetCharStream extends InputStream {
    final private InputStream in;
    static final private String charSet = "ISO-8859-1";

    public OctetCharStream(InputStream in)
    {
     this.in=in;
    }

    @Override public int read() throws IOException {
     return this.in.read();
    }

    public String readLine() throws IOException
    {
     StringBuilder sb = new StringBuilder();
     while (true)
     {
      /*
       *  cast from byte to char: 
       *  fine for 8-byte character sets
       *  but not good in general 
       */
      char c = (char) read();
      if (c == '\n')
       break;   
      sb.append(c);
     }
     return sb.toString();
    }
    public String readCharacters(int n) throws IOException
    {
     byte[] b = new byte[n];
     int i = read(b);
     String s = new String(b, 0, i, charSet);
     return s;
    }
}

Interestingly, when I tried using InputStreamReader alone rather than wrapping BufferedReader around it, the InputStreamReader.read() still buffers to some extent, by reading "greedily" more than one character even if you just want to pull out one character. So I couldn't use InputStreamReader to wrap an InputStream and try to use both the InputStream and InputStreamReader to read bytes/characters according to which one I needed at the moment.

Jason S 2009-11-05 15:58:58

Answer 6

+1 A:

Instead of using a Reader and InputStream and attempting to switch back and forth between the two, try using a callback interface with one method for binary data and another for character data. e.g.

interface MixedProcessor {
    void processBinaryData(byte[] bytes, int off, int len);
    void processText(String line);
}

Then have another "splitter" class that:

Decides which sections of the input are text and which are binary, and passes them to the corresponding processor method
Converts bytes to characters when required (with the help of a CharsetDecoder)

The splitter class might look something like this:

class Splitter {
    public Splitter(Charset charset) { /* ... */ }
    public void readFully(InputStream is, MixedProcessor processor) throws IOException  { /* ... */ }
}

finnw 2009-11-05 19:30:59

hmm. interesting. this is like SAX instead of StAX for XML (push-processing vs. pull-processing). Callbacks would add significant amounts of complication to my specific application, but in general it might be useful.

Jason S 2009-11-05 19:34:38

ansaurus

tags:

views:

answers:

Java: interspersing bytes and characters

related questions