tags:

views:

73

answers:

3

This is specifically aimed at parsing hex bytes, but there's a more general question here.

Suppose I have a regexp r e.g. \\s*([0-9A-Fa-f]{2})\\s* (optional spaces, 2 hex digits that I'm interested in, and optional spaces).

If I want to parse a string s with this regexp such that:

  • if s can be divided into a sequence of blocks that matches r, I want to do something for each block. (e.g. ff 7c 0903 02BB aC could be divided in this way.)

  • If s cannot be divided accordingly, I want to detect this. (e.g. 00 01 02 hi there ab ff and 9 0 2 1 0 and Y0 DEADBEEF and cafe BABE! all fail.)

how could I do this with Java's regexp facilities?

+3  A: 

I believe this is a use case for java.util.Scanner. You could use either next(String) or next(Pattern) to discover whether the next token matched your regex.

I don't have a compiler handy, but I think it would go something like this:

Scanner myScanner = new Scanner(mySource);
// default delimiter is any whitespace, so you don't need to call useDelimiter()
Pattern myPattern = Pattern.compile("\\s*([0-9A-Fa-f]{2})\\s*");
String s = null;
while ((s = myScanner.next(myPattern)) != null) {
    // do something with the token
}
Michael Myers
interesting, ok, how can I make sure there's no non-matching input before/after/between tokens?
Jason S
Hmm... it's been a while, but I think you'd have to try `hasNext()` and `skip()`.
Michael Myers
+2  A: 

Another option would be to use the regex matcher stuff and the lookingAt() method.

Something like:

Pattern p = Pattern.compile( "\\s*([0-9A-Fa-f]{2})" );
Matcher m = p.matcher( myString );
int lastEnd = 0;
while( m.lookingAt() ) {
    System.out.println( "Hex part:" + m.group(1) );
    lastEnd = m.end();
}   
if( lastEnd < myString.length() ) {
    System.err.println( "Encountered non-hex value at index:" + lastEnd );
}

...or whatever. lookingAt() has to start at the current position and so the matches must all be contiguous. The only error condition to catch is finishing early since that means non-hex-formatted data was encountered.

PSpeed
neat! I ended up doing this approach manually (checking the previous end() vs. the current start()), didn't know about lookingAt().
Jason S
That's not right. `lookingAt()` only matches at the beginning of the Matcher's region, which is the beginning of the string by default. You *could* make this approach work by constantly changing the starting bound of the region, but it's much easier just to prepend `\G` to the regex and use `find()`. As it is, your code just keeps matching the first two hex digits in an infinite loop (if it matches anything, that is).
Alan Moore
He's right. The code I've done that used lookingAt() for similar purpose was also chopping the string up each time... which is another option. myString = myString.substring(lastEnd) is nearly free. I forgot to put it.
PSpeed
+2  A: 

You can check the complete input by adding anchors, or by using matches() instead of contains(), the regexp becomes:

^(\\s*([0-9A-Fa-f]{2}))+\\s*$

If this rgeexp matches, you can then proceed and iterate over the matches for:

\\s*([0-9A-Fa-f]{2})

to pick up the hex bytes.

rsp
wasn't planning on using 2 regexps, but this is certainly simple + straightforward.
Jason S
This is the best answer so far, but the other method you're thinking of is `Matcher#find()`; `contains()` is a String method that just does a literal text search.
Alan Moore
@Alan, thanks for your remark, I was refering to the Jakarta ORO methods matches and contains.
rsp