ansaurus

Question

java: using regex to parse repeated substrings

Answer 1

+3 A:

I believe this is a use case for java.util.Scanner. You could use either next(String) or next(Pattern) to discover whether the next token matched your regex.

I don't have a compiler handy, but I think it would go something like this:

Scanner myScanner = new Scanner(mySource);
// default delimiter is any whitespace, so you don't need to call useDelimiter()
Pattern myPattern = Pattern.compile("\\s*([0-9A-Fa-f]{2})\\s*");
String s = null;
while ((s = myScanner.next(myPattern)) != null) {
    // do something with the token
}

Michael Myers 2009-12-30 20:40:53

interesting, ok, how can I make sure there's no non-matching input before/after/between tokens?

Jason S 2009-12-30 20:55:29

Hmm... it's been a while, but I think you'd have to try `hasNext()` and `skip()`.

Michael Myers 2009-12-30 21:21:29

Answer 2

+2 A:

Another option would be to use the regex matcher stuff and the lookingAt() method.

Something like:

Pattern p = Pattern.compile( "\\s*([0-9A-Fa-f]{2})" );
Matcher m = p.matcher( myString );
int lastEnd = 0;
while( m.lookingAt() ) {
    System.out.println( "Hex part:" + m.group(1) );
    lastEnd = m.end();
}   
if( lastEnd < myString.length() ) {
    System.err.println( "Encountered non-hex value at index:" + lastEnd );
}

...or whatever. lookingAt() has to start at the current position and so the matches must all be contiguous. The only error condition to catch is finishing early since that means non-hex-formatted data was encountered.

PSpeed 2009-12-30 21:31:11

neat! I ended up doing this approach manually (checking the previous end() vs. the current start()), didn't know about lookingAt().

Jason S 2009-12-30 21:40:39

That's not right. `lookingAt()` only matches at the beginning of the Matcher's region, which is the beginning of the string by default. You *could* make this approach work by constantly changing the starting bound of the region, but it's much easier just to prepend `\G` to the regex and use `find()`. As it is, your code just keeps matching the first two hex digits in an infinite loop (if it matches anything, that is).

Alan Moore 2010-01-02 05:42:03

He's right. The code I've done that used lookingAt() for similar purpose was also chopping the string up each time... which is another option. myString = myString.substring(lastEnd) is nearly free. I forgot to put it.

PSpeed 2010-01-02 10:52:24

Answer 3

+2 A:

You can check the complete input by adding anchors, or by using matches() instead of contains(), the regexp becomes:

^(\\s*([0-9A-Fa-f]{2}))+\\s*$

If this rgeexp matches, you can then proceed and iterate over the matches for:

\\s*([0-9A-Fa-f]{2})

to pick up the hex bytes.

rsp 2009-12-30 21:32:39

wasn't planning on using 2 regexps, but this is certainly simple + straightforward.

Jason S 2009-12-30 21:39:10

This is the best answer so far, but the other method you're thinking of is `Matcher#find()`; `contains()` is a String method that just does a literal text search.

Alan Moore 2010-01-02 05:48:32

@Alan, thanks for your remark, I was refering to the Jakarta ORO methods matches and contains.

rsp 2010-01-02 10:06:19

ansaurus

tags:

views:

answers:

java: using regex to parse repeated substrings

related questions