tags:

views:

877

answers:

4

I am trying to read a UTF8 string via a java.nio.ByteBuffer. The size is an unsinged int, which, of course, Java doesn't have. I have read the value into a long so that I have the value.

The next issue I have is that I cannot create an array of bytes with the long, and casting he long back to an int will cause it to be signed.

I also tried using limit() on the buffer, but again it works with int not long.

The specific thing I am doing is reading the UTF8 strings out of a class file, so the buffer has more in it that just the UTF8 string.

Any ideas on how to read a UTF8 string that has a potential length of an unsigned int from a ByteBuffer.

EDIT:

Here is an example of the issue.

SourceDebugExtension_attribute {
       u2 attribute_name_index;
       u4 attribute_length;
       u1 debug_extension[attribute_length];
    }

attribute_name_index
    The value of the attribute_name_index item must be a valid index into the constant_pool table. The constant_pool entry at that index must be a CONSTANT_Utf8_info structure representing the string "SourceDebugExtension".

attribute_length
    The value of the attribute_length item indicates the length of the attribute, excluding the initial six bytes. The value of the attribute_length item is thus the number of bytes in the debug_extension[] item.

debug_extension[]
    The debug_extension array holds a string, which must be in UTF-8 format. There is no terminating zero byte.

    The string in the debug_extension item will be interpreted as extended debugging information. The content of this string has no semantic effect on the Java Virtual Machine.

So, from a technical point of view, it is possible to have a string in the class file that is the full u4 (unsigned, 4 bytes) in length.

These won't be an issue if there is a limit to the size of a UTF8 string (I am no UTF8 expert so perhaps there is such a limit).

I could just punt on it and go with the reality that there is not going to be a String that long...

+5  A: 

Unless your array of bytes is more than 2GB (the largest positive value of a Java int), you won't have a problem with casting the long back into a signed int.

If your array of bytes needs to be more than 2GB in length, you're doing it wrong, not least because that's way more than the default maximum heapsize of the JVM...

Alnitak
Surely the array of bytes can be longer in the case where the encoding is >1 byte/character. String encapsulates a char[] array, not a byte[] array.
oxbow_lakes
sure, it can be longer. It's never going to hit 2 GB though.
Alnitak
see my edit... the size is not under my control at all...
TofuBeer
but no Java array (nor ByteBuffer, apparently) can be longer than 2 GB anyway. Seriously, just accept that this is your limit, and have your code give an error in the extremely unlikely circumstance that the long can't be cleanly cast to an int.
Alnitak
@Alnitak Sold! I just needed to be convinced! :-) Thanks
TofuBeer
No reason to fail if you can just do whatever you need to do with that String on parts of it one after another rather than keeping it all in memory at the same time.
Michael Borgwardt
I agree with the answer: if you find yourself juggling a monster 2GB String in memory, then you are almost definitely doing it wrong! Using something like a BufferedReader surely would be better then? Unless you want to implement a GiganticString class to handle this case? ;-)
KarstenF
+1  A: 

Having signed int won't be your main problem. Say you had a String which was 4 billion in length. You would need a ByteBuffer which is at least 4 GB, a byte[] which is at least 4 GB. When you convert this to a String, you need at least 8 GB (2 bytes per character) and a StringBuilder to build it. (Of at least 8 GB) All up you need, 24 GB to process 1 String. Even if you have a lot of memory you won't get many Strings of this size.

Another approach is to treat the length as signed and if unsigned treat as a error as you won't have enough memory to process the String in any case. Even to handle a String which is 2 billion (2^31-1) in length you will need 12 GB to convert it to a String this way.

Peter Lawrey
+1  A: 

Java arrays use a (Java, i.e. signed) int for access as per the languge spec, so it's impossible to have an String (which is backed by a char array) longer than Integer.MAX_INT

But even that much is way too much to be processing in one chunk - it'll totally kill performance and make your program fail with an OutOfMemoryError on most machines if a sufficiently large String is ever encountered.

What you should do is process any string in chunks of a sensible size, say a few megs at a time. Then there's no practical limit on the size you can deal with.

Michael Borgwardt
A: 

I guess you could implement CharSequence on top of a ByteBuffer. That would allow you to keep your "String" from turning up on the heap, although most utilities that deal with characters actually expect a String. And even then, there is actually a limit on CharSequence as well. It expects the size to be returned as an int.

(You could theoretically create a new version of CharSequence that returns the size as a long, but then there's nothing in Java that would help you in dealing with that CharSequence. Perhaps it would be useful if you would implement subSequence(...) to return an ordinary CharSequence.)

Wilfred Springer