What's an efficient way of splitting a String into chunks of 1024 bytes in java? If there is more than one chunk then the header(fixed size string) needs to be repeated in all subsequent chunks.
Let's start with the fact that Strings are composed of 16-bit (two-byte) characters. So splitting them into bytes is meaningless.
Now: what is this "header" that you're talking about? Is it also 1024 characters? And if there's a header, should we presume that there is data following? And is each data record 1024 characters? Or do you plan to break records in the middle?
As for actually extracting the substrings ... well, String.substring(int start, int end) seems to be a pretty clear choice.
You have two ways, the fast and the memory conservative way. But first, you need to know what characters are in the String. ASCII? Are there umlauts (characters between 128 and 255) or even Unicode (s.getChar() returns something > 256). Depending on that, you will need to use a different encoding. If you have binary data, try "iso-8859-1" because it will preserve the data in the String. If you have Unicode, try "utf-8". I'll assume binary data:
String encoding = "iso-8859-1";
The fastest way:
ByteArrayInputStream in = new ByteArrayInputStream (string.getBytes(encoding));
Note that the String is Unicode, so every character needs two bytes. You will have to specify the encoding (don't rely on the "platform default". This will only cause pain later).
Now you can read it in 1024 chunks using
byte[] buffer = new byte[1024];
int len;
while ((len = in.read(buffer)) > 0) { ... }
This needs about three times as much RAM as the original String.
A more memory conservative way is to write a converter which takes a StringReader and an OutputStreamWriter (which wraps a ByteArrayOutputStream). Copy bytes from the reader to the writer until the underlying buffer contains one chunk of data:
When it does, copy the data to the real output (prepending the header), copy the additional bytes (which the Unicode->byte conversion may have generated) to a temp buffer, call buffer.reset() and write the temp buffer to buffer.
Code looks like this (untested):
StringReader r = new StringReader (string);
ByteArrayOutputStream buffer = new ByteArrayOutputStream (1024*2); // Twice as large as necessary
OutputStreamWriter w = new OutputStreamWriter (buffer, encoding);
char[] cbuf = new char[100];
byte[] tempBuf;
int len;
while ((len = r.read(cbuf, 0, cbuf.length)) > 0) {
w.write(cbuf, 0, len);
w.flush();
if (buffer.size()) >= 1024) {
tempBuf = buffer.toByteArray();
... ready to process one chunk ...
buffer.reset();
if (tempBuf.length > 1024) {
buffer.write(tempBuf, 1024, tempBuf.length - 1024);
}
}
}
... check if some data is left in buffer and process that, too ...
This only needs a couple of kilobytes of RAM.
[EDIT] There has been a lengthy discussion about binary data in Strings in the comments. First of all, it's perfectly safe to put binary data into a String as long as you are careful when creating it and storing it somewhere. To create such a String, take a byte[] array and:
String safe = new String (array, "iso-8859-1");
In Java, ISO-8859-1 (a.k.a ISO-Latin1) is a 1:1 mapping. This means the bytes in the array will not be interpreted in any way. Now you can use substring() and the like on the data or search it with index, run regexp's on it, etc. For example, find the position of a 0-byte:
int pos = safe.indexOf('\u0000');
This is especially useful if you don't know the encoding of the data and want to have a look at it before some codec messes with it.
To write the data somewhere, the reverse operation is:
byte[] data = safe.getBytes("iso-8859-1");
Never use the default methods new String(array)
or String.getBytes()
! One day, your code is going to be executed on a different platform and it will break.
Now the problem of characters > 255 in the String. If you use this method, you won't ever have any such character in your Strings. That said, if there were any for some reason, then getBytes() would throw an Exception because there is no way to express all Unicode characters in ISO-Latin1, so you're safe in the sense that the code will not fail silently.
Some might argue that this is not safe enough and you should never mix bytes and String. In this day an age, we don't have that luxury. A lot of data has no explicit encoding information (files, for example, don't have an "encoding" attribute in the same way as they have access permissions or a name). XML is one of the few formats which has explicit encoding information and there are editors like Emacs or jEdit which use comments to specify this vital information. This means that, when processing streams of bytes, you must always know in which encoding they are. As of now, it's not possible to write code which will always work, no matter where the data comes from.
Even with XML, you must read the header of the file as bytes to determine the encoding before you can decode the meat.
The important point is to sit down and figure out which encoding was used to generate the data stream you have to process. If you do that, you're good, if you don't, you're doomed. The confusion originates from the fact that most people are not aware that the same byte can mean different things depending on the encoding or even that there is more than one encoding. Also, it would have helped if Sun hadn't introduced the notion of "platform default encoding."
Important points for beginners:
- There is more than one encoding (charset).
- There are more characters than the English language uses. There are even several sets of digits (ASCII, full width, Arabic-Indic, Bengali).
- You must know which encoding was used to generate the data which you are processing.
- You must know which encoding you should use to write the data you are processing.
- You must know the correct way to specify this encoding information so the next program can decode your output (XML header, HTML meta tag, special encoding comment, whatever).
The days of ASCII are over.
I would do it via String.getBytes().
Then do a loop over the returned array and count up i.
If i % 1024 = 0, chunk and add it to a List<byte[]>
.
Strings and bytes are two completely different things, so wanting to split a String into bytes is as meaningless as wanting to split a painting into verses.
What is it that you actually want to do?
To convert between strings and bytes, you need to specify an encoding that can encode all the characters in the String. Depending on the encoding and the characters, some of them may span more than one byte.
You can either split the String into chunks of 1024 characters and encode those as bytes, but then each chunk may be more than 1024 bytes.
Or you can encode the original string into bytes and then split them into chunks of 1024, but then you have to make sure to append them as bytes before decoding the whole into a String again, or you may get garbled characters at the split points when a character spans more than 1 byte.
If you're worried about memory usage when the String can be very long, you should use streams (java.io package) to to the en/decoding and splitting, in order to avoid keeping the data in memory several times as copies. Ideally, you should avoid having the original String in one piece at all and instead use streams to read it in small chunks from wherever you get it from.