views:

1076

answers:

7

I want to do something like a regular expression in Java, but on a byte array instead of a String

For example, let's say I want to delete from the array all continuous segments of 0's longer than 3 bytes

byte a[] = {1,2,3,0,1,2,3,0,0,0,0,4};
byte r[] = magic(a);
System.out.println(r);

result

{1,2,3,0,1,2,3,4}

Is there something that can help me built-in (or is there a good third party tool), or do I need to work from scratch?

  • Strings are UTF-16, so converting back and forth isn't a good idea? At least it's a lot of wasted overhead ... right?
A: 

Java Regex operates on CharSequences - you could CharBuffer to wrap your existing byte array (you might need to cast it to char[] ?) and interpret it as such, and then perform regex on that?

Amber
+1  A: 

I don't see how regex would be useful to do what you want. One thing you can do is use Run Length Encoding to encode that byte array, replace every ocurrence of "30" (read three 0's) with the empty string, and decode the final string. Wikipedia has a simple Java implementation of it.

JG
I thought the 3 0s was just an example.
Vinay Sajip
+2  A: 

regex is not the tool for the job, you will instead need to implement that from scratch

objects
+1  A: 

Although there's a reasonable ByteString library floating around, nobody that I've seen has implemented a general regexp library on them.

I recommend solving your problem directly rather than implementing a regexp library :)

If you do convert to string and back, you probably won't find any existing encoding that gives you a round trip for your 0 bytes. If that's the case, you'd have to write your own byte array <-> string converters; not worth the trouble.

wrang-wrang
+3  A: 
byte[] a = {1,2,3,0,1,2,3,0,0,0,0,4};
String s0 = new String(a, "ISO-8859-1");
String s1 = s0.replaceAll("\\x00{4,}", "");
byte[] r = s1.getBytes("ISO-8859-1");

System.out.println(Arrays.toString(r)); // [1, 2, 3, 0, 1, 2, 3, 4]

I used ISO-8859-1 (latin1) because, unlike any other encoding,

  • every byte in the range 0x00..0xFF maps to a valid character, and

  • each of those characters has the same numeric value as its latin1 encoding.

That means the string is the same length as the original byte array, you can match any byte by its numeric value with the \xFF construct, and you can convert the resulting string back to a byte array without losing information.

I wouldn't try to display the data while it's in string form--although all the characters are valid, many of them are not printable. Also, avoid manipulating the data while it's in string form; you might accidentally do some escape-sequence substitutions or another encoding conversion without realizing it. In fact, I wouldn't recommend doing this kind of at all, but that isn't what you asked. :)

Also, be aware that this technique won't necessarily work in other programming languages or regex flavors. You would have to test each one individually.

Alan Moore
A: 

I'd suggest converting the byte array into a String, performing the regex, and then converting it back. Here's a working example:

public void testRegex() throws Exception {
 byte a[] = { 1, 2, 3, 0, 1, 2, 3, 0, 0, 0, 0, 4 };
 String s = btoa(a);
 String t = s.replaceAll("\u0000{4,}", "");
 byte b[] = atob(t);
 System.out.println(Arrays.toString(b));
}

private byte[] atob(String t) {
 char[] array = t.toCharArray();
 byte[] b = new byte[array.length];
 for (int i = 0; i < array.length; i++) {
  b[i] = (byte) Character.toCodePoint('\u0000', array[i]);
 }
 return b;
}

private String btoa(byte[] a) {
 StringBuilder sb = new StringBuilder();
 for (byte b : a) {
  sb.append(Character.toChars(b));
 }
 return sb.toString();
}

For more complicated transformations, I'd suggest using a Lexer. Both JavaCC and ANTLR have support for parsing/transforming binary files.

brianegge
+1  A: 

I would suggest you just implement a CharSequence wrapper on a byte array. Something like this (I just wrote this directly in, not compiled... but you get the idea).

public class ByteChars 
implements CharSequence

...

ByteChars(byte[] arr) {
    this(arr,0,arr.length);
    }

ByteChars(byte[] arr, int str, int end) {
    //check str and end are within range here
    strOfs=str;
    endOfs=end;
    bytes=arr;
    }

public char charAt(int idx) { 
    //check idx is within range here
    return (char)(bytes[strOfs+idx]&0xFF); 
    }

public int length() { 
    return (endOfs-strOfs); 
    }

public CharSequence subSequence(int str, int end) { 
    //check str and end are within range here
    return new ByteChars(arr,(strOfs+str,strOfs+end); 
    }

public String toString() { 
    return new String(bytes,strOfs,(endOfs-strOfs),"ISO8859_1");
    }

And, of course, you could easily make a reusable mutable variant so as not to have to create a new object for every byte array.

Software Monkey