ansaurus

Question

What Java regular expression do I need to match this text?

Answer 1

A:

Hi

Depends a bit on what your regex engine is but something like this;

^ZZ.*[^Z][^Z]$

works (at least for simple cases) in Emacs. ^ anchors the regex to the start of a line/record, and $ to the end

Regards

Mark

High Performance Mark 2009-08-14 14:23:20

Answer 2

+1 A:

I'd suggest something like...

/ZZ(.*?)(ZZ|$)/

This will match:

ZZ — the literal string
(.*?) — anychars
(ZZ|$) — either another ZZ literal, or the end of the string

VoteyDisciple 2009-08-14 14:24:08

I think he wants specifically NOT to match the ZZ literal at the end of the record.

Platinum Azure 2009-08-14 14:26:26

@Platinum Azure: I only want to match a trailing record with no ZZ at the end.

DLauer 2009-08-14 14:33:23

Answer 3

+3 A:

(Edited after the post of the 3rd example)

Try:

(?!ZZZ)ZZ((?!ZZ).)++$

Demo:

import java.util.regex.*;

public class Main {
    public static void main(String[] args) {
        String[] tests = {
            "ZZoneZZZZtwoZZZZthree",
            "ZZoneZZZZtwoZZZZthreeZZ",
            "ZZoneZZZZtwoZZZZthreeZee"
        };
        Pattern p = Pattern.compile("(?!ZZZ)ZZ((?!ZZ).)++$");
        for(String tst : tests) {
            Matcher m = p.matcher(tst);
            System.out.println(tst+" -> "+(m.find() ? m.group() : "no!"));
        }
    }
}

Bart Kiers 2009-08-14 14:25:59

This is close and a good solution but it brings out 3 Z's.

skyfoot 2009-08-14 14:58:34

Based on the latest version of the spec, I think you should go back to the `(?!ZZ)` version. It's okay to match three Z's at the beginning, **if** they're preceded by two Z's or the beginning of the string: `(?<=[^Z]ZZ|^)`

Alan Moore 2009-08-14 16:29:13

Answer 4

+1 A:

^ZZ.*(?<!ZZ)$


Assert position at the beginning of the string «^»
Match the characters “ZZ” literally «ZZ»
Match any single character that is not a line break character «.*»
   Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!ZZ)»
   Match the characters “ZZ” literally «ZZ»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»


Created with RegexBuddy

crono 2009-08-14 14:38:20

+1 for use of lookahead to avoid matching "Z" as terminal string.

Alex Feinman 2009-08-14 15:35:29

RegexBuddy is pretty handy

pjp 2009-08-14 17:25:07

Answer 5

A:

There's one tricky part to this: The ZZ being both the start token and the end token.

There's one start case (ZZ, not followed by another ZZ which would signify that the first ZZ was actually an end token), and two end cases (ZZ end of string, ZZ followed by ZZ). The goal is to match the start case and NOT either of the end cases.

To that end, here's what I suggest:

/ZZ(?!ZZ)(.*?)(ZZ(?!(ZZ|$))|$)/

For string ZZfooZZZZbarZZbazZZ:

This will NOT match ZZfooZZ, a legitimate record: ZZ, not followed by ZZ, followed by any combination of characters (here "foo"), followed by ZZ, but that ZZ is followed by ZZ, which opens the next record.
The next part examined is the ZZ after foo. This fails because the ZZ cannot be followed by another ZZ, yet in this case it is. This is as we want because the ZZ right after foo does not start a new record anyway.
The ZZ right before bar is not followed by another ZZ, so it's a legitimate start of record. "bar" is consumed by the .*?. Then there is a ZZ, but it is NOT followed by another ZZ or the end of string, which means that the ZZbar token is no good.
- (It COULD be interpreted by a human as ZZbarZZ with bazZZ not being valid, but in either case there's something wrong, so I just wrote the regex to consider the wrongly-formatted record to occur here)
- So ZZbar will be caught/matched by the regex, as illegitimate.
The ZZ after the bar isn't followed by ZZ, is followed by baz, followed by a ZZ that fails the lookahead assertion stating it can't be followed by the end of the string. So ZZbazZZ is a legitimate record and is not captured in the regex.

One more case: For ZZfoo, the beginning ZZ is okay, the foo is captured, then the regex notes that it's the end of the string, and no ZZ has occurred. Thus, ZZfoo is captured as an illegitimate match.

Let me know if this doesn't make sense, so I can make it more clear.

Platinum Azure 2009-08-14 14:47:42

Answer 6

A:

How about trying to remove all matches for ZZallcharsZZ and what you have left is what you want.

ZZ.*?ZZ

skyfoot 2009-08-14 14:52:59

Answer 7

+2 A:

To match only the final, unterminated record:

(?<=[^Z]ZZ|^)ZZ(?:(?!ZZ).)++$

The starting delimiter is two Z's, but there can be a third Z that's considered part of the data. The lookbehind ensures that you don't match a Z that's part of the previous record's ending delimiter (since an ending delimiter can not be preceded by a non-delimiter Z). However, this assumes there will never be empty records (or records containing only a single Z), which could lead to eight or more Z's in a row:

ZZabcZZZZdefZZZZZZZZxyz

If that were possible, I would forget about trying to match the final record by itself, and instead match all of them from the beginning:

(?:ZZ(?:(?!ZZ).)*+ZZ)*+(ZZ(?:(?!ZZ).)++$)

The final, unterminated record is now captured in group #1.

Alan Moore 2009-08-14 17:22:41

This solution is the one I'm now using. That's some regex magic!

DLauer 2009-08-14 17:39:39

"Magic" is a good word for regexes: capable of wondrous things, but temperamental, and never fully understood. :)

Alan Moore 2009-08-15 07:48:20

ansaurus

tags:

views:

answers:

What Java regular expression do I need to match this text?

related questions