views:

258

answers:

7

Hi, I'm trying to match the following using a regular expression in Java - I have some data separated by the two characters 'ZZ'. Each record starts with 'ZZ' and finishes with 'ZZ' - I want to match a record with no ending 'ZZ' for example, I want to match the trailing 'ZZanychars' below (Note: the *'s are not included in the string - they're just marking the bit I want to match).

ZZanycharsZZZZanycharsZZ**ZZanychars**

But I don't want the following to match because the record has ended:

ZZanycharsZZZZanycharsZZZZanycharsZZ

EDIT: To clarify things - here are the 2 testcases I am using:

// This should match and in one of the groups should be 'ZZthree'
String testString1 = "ZZoneZZZZtwoZZZZthree";

// This should not match
String testString2 = "ZZoneZZZZtwoZZZZthreeZZ";

EDIT: Adding a third test:

// This should match and in one of the groups should be 'threeZee'
String testString3 = "ZZoneZZZZtwoZZZZthreeZee";
A: 

Hi

Depends a bit on what your regex engine is but something like this;

^ZZ.*[^Z][^Z]$

works (at least for simple cases) in Emacs. ^ anchors the regex to the start of a line/record, and $ to the end

Regards

Mark

High Performance Mark
+1  A: 

I'd suggest something like...

/ZZ(.*?)(ZZ|$)/

This will match:

  1. ZZ — the literal string
  2. (.*?) — anychars
  3. (ZZ|$) — either another ZZ literal, or the end of the string
VoteyDisciple
I think he wants specifically NOT to match the ZZ literal at the end of the record.
Platinum Azure
@Platinum Azure: I only want to match a trailing record with no ZZ at the end.
DLauer
+3  A: 

(Edited after the post of the 3rd example)

Try:

(?!ZZZ)ZZ((?!ZZ).)++$

Demo:

import java.util.regex.*;

public class Main {
    public static void main(String[] args) {
        String[] tests = {
            "ZZoneZZZZtwoZZZZthree",
            "ZZoneZZZZtwoZZZZthreeZZ",
            "ZZoneZZZZtwoZZZZthreeZee"
        };
        Pattern p = Pattern.compile("(?!ZZZ)ZZ((?!ZZ).)++$");
        for(String tst : tests) {
            Matcher m = p.matcher(tst);
            System.out.println(tst+" -> "+(m.find() ? m.group() : "no!"));
        }
    }
}
Bart Kiers
This is close and a good solution but it brings out 3 Z's.
skyfoot
Based on the latest version of the spec, I think you should go back to the `(?!ZZ)` version. It's okay to match three Z's at the beginning, **if** they're preceded by two Z's or the beginning of the string: `(?<=[^Z]ZZ|^)`
Alan Moore
+1  A: 
^ZZ.*(?<!ZZ)$


Assert position at the beginning of the string «^»
Match the characters “ZZ” literally «ZZ»
Match any single character that is not a line break character «.*»
   Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!ZZ)»
   Match the characters “ZZ” literally «ZZ»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»


Created with RegexBuddy
crono
+1 for use of lookahead to avoid matching "Z" as terminal string.
Alex Feinman
RegexBuddy is pretty handy
pjp
A: 

There's one tricky part to this: The ZZ being both the start token and the end token.

There's one start case (ZZ, not followed by another ZZ which would signify that the first ZZ was actually an end token), and two end cases (ZZ end of string, ZZ followed by ZZ). The goal is to match the start case and NOT either of the end cases.

To that end, here's what I suggest:

/ZZ(?!ZZ)(.*?)(ZZ(?!(ZZ|$))|$)/

For string ZZfooZZZZbarZZbazZZ:

  • This will NOT match ZZfooZZ, a legitimate record: ZZ, not followed by ZZ, followed by any combination of characters (here "foo"), followed by ZZ, but that ZZ is followed by ZZ, which opens the next record.
  • The next part examined is the ZZ after foo. This fails because the ZZ cannot be followed by another ZZ, yet in this case it is. This is as we want because the ZZ right after foo does not start a new record anyway.
  • The ZZ right before bar is not followed by another ZZ, so it's a legitimate start of record. "bar" is consumed by the .*?. Then there is a ZZ, but it is NOT followed by another ZZ or the end of string, which means that the ZZbar token is no good.
    • (It COULD be interpreted by a human as ZZbarZZ with bazZZ not being valid, but in either case there's something wrong, so I just wrote the regex to consider the wrongly-formatted record to occur here)
    • So ZZbar will be caught/matched by the regex, as illegitimate.
  • The ZZ after the bar isn't followed by ZZ, is followed by baz, followed by a ZZ that fails the lookahead assertion stating it can't be followed by the end of the string. So ZZbazZZ is a legitimate record and is not captured in the regex.

One more case: For ZZfoo, the beginning ZZ is okay, the foo is captured, then the regex notes that it's the end of the string, and no ZZ has occurred. Thus, ZZfoo is captured as an illegitimate match.

Let me know if this doesn't make sense, so I can make it more clear.

Platinum Azure
A: 

How about trying to remove all matches for ZZallcharsZZ and what you have left is what you want.

ZZ.*?ZZ
skyfoot
+2  A: 

To match only the final, unterminated record:

(?<=[^Z]ZZ|^)ZZ(?:(?!ZZ).)++$

The starting delimiter is two Z's, but there can be a third Z that's considered part of the data. The lookbehind ensures that you don't match a Z that's part of the previous record's ending delimiter (since an ending delimiter can not be preceded by a non-delimiter Z). However, this assumes there will never be empty records (or records containing only a single Z), which could lead to eight or more Z's in a row:

ZZabcZZZZdefZZZZZZZZxyz

If that were possible, I would forget about trying to match the final record by itself, and instead match all of them from the beginning:

(?:ZZ(?:(?!ZZ).)*+ZZ)*+(ZZ(?:(?!ZZ).)++$)

The final, unterminated record is now captured in group #1.

Alan Moore
This solution is the one I'm now using. That's some regex magic!
DLauer
"Magic" is a good word for regexes: capable of wondrous things, but temperamental, and never fully understood. :)
Alan Moore