views:

5165

answers:

4

What's an efficient way of splitting a String into chunks of 1024 bytes in java? If there is more than one chunk then the header(fixed size string) needs to be repeated in all subsequent chunks.

A: 

Let's start with the fact that Strings are composed of 16-bit (two-byte) characters. So splitting them into bytes is meaningless.

Now: what is this "header" that you're talking about? Is it also 1024 characters? And if there's a header, should we presume that there is data following? And is each data record 1024 characters? Or do you plan to break records in the middle?

As for actually extracting the substrings ... well, String.substring(int start, int end) seems to be a pretty clear choice.

kdgregory
The need for this comes from the fact that the content needs to be sent over the network in chunks < 1024 bytes. The header is a substring of the original String that needs to be repeated in all subsequent chunks.
Assuming that this is record-formatted data, breaking at record boundaries versus fixed size seems to be the best solution. Then you need to ensure that both ends of the connection use the same encoding (UTF-8 would be best), and deal with the overhead of that.
kdgregory
This is a free form string so no assumptions can be made about the content.
+2  A: 

You have two ways, the fast and the memory conservative way. But first, you need to know what characters are in the String. ASCII? Are there umlauts (characters between 128 and 255) or even Unicode (s.getChar() returns something > 256). Depending on that, you will need to use a different encoding. If you have binary data, try "iso-8859-1" because it will preserve the data in the String. If you have Unicode, try "utf-8". I'll assume binary data:

String encoding = "iso-8859-1";

The fastest way:

ByteArrayInputStream in = new ByteArrayInputStream (string.getBytes(encoding));

Note that the String is Unicode, so every character needs two bytes. You will have to specify the encoding (don't rely on the "platform default". This will only cause pain later).

Now you can read it in 1024 chunks using

byte[] buffer = new byte[1024];
int len;
while ((len = in.read(buffer)) > 0) { ... }

This needs about three times as much RAM as the original String.

A more memory conservative way is to write a converter which takes a StringReader and an OutputStreamWriter (which wraps a ByteArrayOutputStream). Copy bytes from the reader to the writer until the underlying buffer contains one chunk of data:

When it does, copy the data to the real output (prepending the header), copy the additional bytes (which the Unicode->byte conversion may have generated) to a temp buffer, call buffer.reset() and write the temp buffer to buffer.

Code looks like this (untested):

StringReader r = new StringReader (string);
ByteArrayOutputStream buffer = new ByteArrayOutputStream (1024*2); // Twice as large as necessary
OutputStreamWriter w = new OutputStreamWriter  (buffer, encoding);

char[] cbuf = new char[100];
byte[] tempBuf;
int len;
while ((len = r.read(cbuf, 0, cbuf.length)) > 0) {
    w.write(cbuf, 0, len);
    w.flush();
    if (buffer.size()) >= 1024) {
        tempBuf = buffer.toByteArray();
        ... ready to process one chunk ...
        buffer.reset();
        if (tempBuf.length > 1024) {
            buffer.write(tempBuf, 1024, tempBuf.length - 1024);
        }
    }
}
... check if some data is left in buffer and process that, too ...

This only needs a couple of kilobytes of RAM.

[EDIT] There has been a lengthy discussion about binary data in Strings in the comments. First of all, it's perfectly safe to put binary data into a String as long as you are careful when creating it and storing it somewhere. To create such a String, take a byte[] array and:

String safe = new String (array, "iso-8859-1");

In Java, ISO-8859-1 (a.k.a ISO-Latin1) is a 1:1 mapping. This means the bytes in the array will not be interpreted in any way. Now you can use substring() and the like on the data or search it with index, run regexp's on it, etc. For example, find the position of a 0-byte:

int pos = safe.indexOf('\u0000');

This is especially useful if you don't know the encoding of the data and want to have a look at it before some codec messes with it.

To write the data somewhere, the reverse operation is:

byte[] data = safe.getBytes("iso-8859-1");

Never use the default methods new String(array) or String.getBytes()! One day, your code is going to be executed on a different platform and it will break.

Now the problem of characters > 255 in the String. If you use this method, you won't ever have any such character in your Strings. That said, if there were any for some reason, then getBytes() would throw an Exception because there is no way to express all Unicode characters in ISO-Latin1, so you're safe in the sense that the code will not fail silently.

Some might argue that this is not safe enough and you should never mix bytes and String. In this day an age, we don't have that luxury. A lot of data has no explicit encoding information (files, for example, don't have an "encoding" attribute in the same way as they have access permissions or a name). XML is one of the few formats which has explicit encoding information and there are editors like Emacs or jEdit which use comments to specify this vital information. This means that, when processing streams of bytes, you must always know in which encoding they are. As of now, it's not possible to write code which will always work, no matter where the data comes from.

Even with XML, you must read the header of the file as bytes to determine the encoding before you can decode the meat.

The important point is to sit down and figure out which encoding was used to generate the data stream you have to process. If you do that, you're good, if you don't, you're doomed. The confusion originates from the fact that most people are not aware that the same byte can mean different things depending on the encoding or even that there is more than one encoding. Also, it would have helped if Sun hadn't introduced the notion of "platform default encoding."

Important points for beginners:

  • There is more than one encoding (charset).
  • There are more characters than the English language uses. There are even several sets of digits (ASCII, full width, Arabic-Indic, Bengali).
  • You must know which encoding was used to generate the data which you are processing.
  • You must know which encoding you should use to write the data you are processing.
  • You must know the correct way to specify this encoding information so the next program can decode your output (XML header, HTML meta tag, special encoding comment, whatever).

The days of ASCII are over.

Aaron Digulla
Would this suffer from the problem that kdgregory was mentioning? That, depending on your platform default encoding, you may split a single character into two meaningless pieces
Please don't use "iso-8859-1". Use "utf8". UTF8 handles pretty much all of iso-8859-1 in a single byte, but can scale up to handle all characters. Yes, unknown, this could split a single character into two meaningless pieces...or thrown them away, which is what iso-8859-1 would do.
Richard Campbell
No, because I'm specifying the encoding "iso-8859-1" (which is Latin-1, i.e. ASCII with Umlauts). If your String contains other characters (above codepoint 256), you must use something else here but Latin-1 is usually good because it doesn't change anything.
Aaron Digulla
Richard: My guess is that he has binary data in that String in which case iso-8859-1 is perfect (it won't change the data).
Aaron Digulla
I improved my answer with some info about the encodings.
Aaron Digulla
I don't have any binary data in the String. I was actually looking at java.nio.ByteBuffer. It looks promising.
If he has binary data in a String, then unless it's in Base64, he has corrupted data and may as well stop right there.
Michael Borgwardt
Nope, you can read binary data into a String without problems. A String can contain any character between 0 and 0xffff which covers all binary codes (0-255). Often, a string is more user friendly than a byte[] array. You just need a bit careful when you read/write it :)
Aaron Digulla
Nope, if you do that you're almost certainly end up corrupting your data. It's a horrible abuse that nobody who considers themselves a professional programmer should ever contemplate. Seriously, it's just a very bad idea.
Michael Borgwardt
Putting binary data in a String can get you into trouble. Reminds me of a bug (actually more a design mistake) I had at work with COMP-3 binary COBOL fields in a copybook that were returned into an EBCDIC String, that got converted into ISO-8859-1 at the destination. Result: garbage.
eljenso
@Aaron: I wouldn't want to leave a time bomb in my program, personally; when you finally try to put a Japanese or Chinese string in that 1024 buffer, it's going to blow up and you might not remember why. I wouldn't store binary data in a String either. A short[] if I wanted to deal with unsigned.
Richard Campbell
See my edit. In short: While it is generally a good idea not to mix bytes and Unicode, sometimes, you have to. For example, when decoding XML in a parser, you must read the header as bytes to determine the encoding. Conclusion: If you don't know what you're doing, it's gonna break.
Aaron Digulla
And if you DO know what you're doing, the next guy to touch the code won't, and THEN it will break. This is very bad advice. People have trouble enough dealing with text; encouraging them to mix it with binary data is just plain irresponsible.
Alan Moore
If every developer would understand how binary data can be handled safely, we wouldn't have this discussion. I explain how it is done correctly and safely since I've never seen that anywhere else (which is probably why most people do it the wrong way which leads to discussions like this one).
Aaron Digulla
I understand that you are all afraid of this. Scared me as well. But things like this must be understood or we will never see the end to the errors about which you complain. Wrapping this in red tape won't improve the situation.
Aaron Digulla
So while in the general case, it is smart to use one of the Unicode encodings, that won't help the guy who asked the question because he needs bytes. He didn't say why or what for but if he's right, my answer is correct.
Aaron Digulla
A: 

I would do it via String.getBytes(). Then do a loop over the returned array and count up i. If i % 1024 = 0, chunk and add it to a List<byte[]>.

furtelwart
Don't forget to specify the encoding of the String.
Aaron Digulla
Aaron: Thanks!The downvoter: No comment to the downvote? What a pity.
furtelwart
You have an unterminated String constant,the return type ArrayList is too specific,you don't handle the checked UnsupportedEncodingException,you don't increment your loop variable,you need to test i%1024==0 before setting tmpBytes[i%1024]=bytes[i],(continued)
eljenso
you end up with an empty first array since 0%1024==0,you don't resize the last tmpBytes array to its actual size,and guess what, I'm not the original downvoter. I would suggest you remove this piece of "code".
eljenso
+4  A: 

Strings and bytes are two completely different things, so wanting to split a String into bytes is as meaningless as wanting to split a painting into verses.

What is it that you actually want to do?

To convert between strings and bytes, you need to specify an encoding that can encode all the characters in the String. Depending on the encoding and the characters, some of them may span more than one byte.

You can either split the String into chunks of 1024 characters and encode those as bytes, but then each chunk may be more than 1024 bytes.

Or you can encode the original string into bytes and then split them into chunks of 1024, but then you have to make sure to append them as bytes before decoding the whole into a String again, or you may get garbled characters at the split points when a character spans more than 1 byte.

If you're worried about memory usage when the String can be very long, you should use streams (java.io package) to to the en/decoding and splitting, in order to avoid keeping the data in memory several times as copies. Ideally, you should avoid having the original String in one piece at all and instead use streams to read it in small chunks from wherever you get it from.

Michael Borgwardt