views:

46

answers:

2

prehistory: http://stackoverflow.com/questions/3262013/java-regular-expression-for-binary-string

I can extract a substring with binary data I need, but when I use

   String s = matcher.group(1);

It seems that data is spoiled,
to be exact spoiled are only those chars that belong to extended ASCII table, probably from 128 to 255. Other chars are kept untouched, but some are corrupted. What I basically mean, is that I need to transform this " s " string into byte array, but this: String s2 = new String(s.getBytes(), "US-ASCII")

or this

String s2 = new String(s.getBytes(), "ISO-8859-1") 

and later,

 fileOutputStream.write(s2.getBytes())

replaces all chars from extended ASCII table to " ? ", while others like \0 or 'A' are kept uncorrupted.

How to interpret a String as plain [0-255] ascii binary symbols ?

PS I solved it, one should use

    String encoding = "ISO-8859-1";

to encode/decode byte arrays, and everything works perfectly.

+1  A: 

Java only knows general Unicode Strings. Whenever you care about the underlying byte values of letters, you are dealing with bytes, and should be using byte arrays. You can only convert Java Strings to byte arrays for a specific encoding (it may be an implicit default argument, but it's always there). You CANNOT use the String data type and expect your particular encoding to be preserved, you really must specify it each and ever time you read data from outside Java or export them elsewhere (such as text field inputs or the file system).

Using byte arrays means that you cannot use Java's built-in support for regular expressions directly. That's kind of a pain, but as you have seen, it wouldn't give correct results anyway, and that's not an accident - it CANNOT work correctly for what you want to do. You really must use something else to manipulate byte streams, because Strings are encoding-agnostic, and always will be.

Kilian Foth
+1  A: 

What I basically mean, is that I need to transform this " s " string into byte array

Answering this directly:

byte[] array = Charset.forName("utf-8").encode(CharBuffer.wrap(s)).array();

Edit:
String has a helper function added that does the same thing as above with a bit less code:

byte[] array = s.getBytes(Charset.forName("utf-8"));
Gunslinger47