How Can I read a file with both ASCII and another encoding in Java cleanly?

views:

316

answers:

How Can I read a file with both ASCII and another encoding in Java cleanly?

I have a custom image file where the first block of data is ASCII meta data. I need to be able to read this ASCII meta-data part of the file with Java and know when it ends, and when the 'raw image data' in another encoding starts.

I was thinking of reading all of the file into a byte[], and then somehow either start reading bytes out of this and convert them to ASCII until I hit the end of the ascii meta-data section, at which point I would store this data. Then I could just rearrange the raw binary data in a different order as-is (no reading necessary). However, the only way I could think about doing this would be to read the ascii stuff byte-by-byte and look for new lines, and concat everything prior to a new line and see if that is the tag which signifies the beginning of the raw image data. However, there must be a better way of reading the ascii part of the file with readLine() and then be able to immediately start with the raw image binary without needed to reopen the file in a new reader and go to the line where in the other reader I found the 'begin image' tag.

Any ideas?

+1 A:

Open the file as FileInputStream (wrapped in a BufferedInputStream)
Create a ByteArrayOutputStream
Read the input stream byte by byte, looking for your "begin image" tag using a string searching algorithm. Cast individual bytes to char (that's using ASCII implicitly)
At the same time, write each byte you've looked at into the ByteArrayOutputStream
Once you've found the tag, you can start reading the image data from the input stream
Get the byte array from the ByteArrayOutputStream and convert it to a String using new String(array, "US-ASCII");

It might be possible to do the string searching easily by using a Scanner on the input stream, but you have to be careful which pattern you use to make sure it will find the tag without starting to read the image data (since you want to read that yourself from the underlying input stream you're keeping a separate reference to).

Edit: Unfortunately, it looks like Scanner implicitly uses a buffer as well, so the only option left is to implement the string search "manually".

Michael Borgwardt 2009-08-27 09:24:33

will this work if my "begin image" tag is actually this: {END} This would be 5 bytes; does this method let me search for strings that are multiple bytes?

hatorade 2009-08-27 09:30:37

Yes, of course. It just makes the searching more complex. Look at the lings to wikipedia's page on String search algorithms, or use the Scanner class.

Michael Borgwardt 2009-08-27 11:23:07

@michael: I'm trying the 'scanner' route (well, BufferedReader, anyway). I'm having trouble getting the FileInputStream to start grabbing bytes where the BufferedReader.readLine() leave's off (I read off the first line, and then grab the next byte, but the next byte is not correct). Have any idea what's wrong?

hatorade 2009-08-27 11:50:36

Yes. You're using BufferedReader, that's what's wrong. Don't. The readLine() functionality is secondary to the buffering, i.e. it reads from the underlying input stream in large chunks and thus makes it impossible to continue at the boundary between text and image data.

Michael Borgwardt 2009-08-27 11:59:56

Changing it to scanner (leaving the rest of the code the same) didn't work. I also tried getting rid of the BufferedInputStream, but that didn't work either. It's still printing the bytes that were printed when I was using BufferedReader.

hatorade 2009-08-27 13:00:01

+1 A:

Not sure if you can decide the format yourself, but anyway:

An alternative strategy is to write an integer value at the first location of the file, which contains the number of bytes which are used for the ascii partition. Then you could just read that amount of bytes, and it is also possible to easily skip the ascii and go directly to the binary blob.

This strategy is efficient, but you cannot change the amount of ascii text characters without changing the count.

By the way, make sure to sanitize your input: Don't try to read more data then the file contains or allocate more memory then the machine is capable of.

Personally I would also use the first couple of characters of the file to contain some magic code, so that you can have a minimal check that the file is using your data format, and what version of the data format.

Johan 2009-08-27 09:45:17

ansaurus

tags:

views:

answers:

How Can I read a file with both ASCII and another encoding in Java cleanly?

related questions