views:

214

answers:

2

I've been able to stumble my way through regular expressions for quite some time, but alas, I cannot help a friend in need.

My "friend" is trying to match all lines in a text file that match the following criteria:

  1. Only a 7 to 10 digit number (0123456 or 0123456789)
  2. Only a 7 to 10 digit number, then a dash, then another two digits (0123456-01 or 0123456789-01)
  3. Match any of the above except where the words Code/code or Passcode/passcode is before the numbers to match (Such as "Access code: 16434629" or "Passcode 5253443-12")
  4. EDIT: Only need the numbers that match, nothing else.

Here is the nastiest regex I have ever seen that "he" gave me:

^(?=.*?[^=/%:]\b\d{7,10}((\d?\d?)|(-\d\d))?\b)((?!Passcode|passcode|Code|code).)*$

...

Question: Is there a way to use a short regex to find all lines that meet the above criteria?

Assume PCRE. My friend thanks you in advance. ;-)

BTW - I have not been able to find any other questions listed in stackoverflow.com or superuser.com which can answer this question accurately.

EDIT: I'm using Kodos Python Regex Debugger to validate and test the regex.

+2  A: 
(?<!(?:[Pp]asscode|[Cc]ode).*)[0-9]{7,10}(?:-[0-9]{2})?

Commented version:

(?<!                 # Begin zero-width negative lookbehind. (Makes sure the following pattern can't match before this position)
(?:                  # Begin non-matching group
[Pp]asscode          # Either Passcode or passcode
|                    # OR
[Cc]ode              # Either Code or code
)                    # End non-matching group
.*                   # Any characters
)                    # End lookbehind
[0-9]{7,10}          # 7 to 10 digits
(?:                  # Begin non-matching group
-[0-9]{2}            # dash followed by 2 digits
)                    # End non-matching group
?                    # Make last group optional

Edit: final version after comment discussion -

/^(?!\D*(?:[Pp]asscode|[Cc]ode))\D*([0-9]{7,10}(?:-[0-9]{2})?)/

(result in first capture buffer)

Amber
Nicely done! Only thing I would add is `:?` after `(?:[Pp]asscode|[Cc]ode)`.
Matthew
Nice on the commented version. The `//x` modifier is *always* your friend (though I would condense it down a little - the "begin/end non-matching group"s seem a little excessive).
Anon.
@Dav: When I use your regex in perl as:if(m{(?<!(?:[Pp]asscode|[Cc]ode).*)[0-9]{7,10}(?:-[0-9]{2})?})I get:Variable length lookbehind not implemented in regex;Am I missing somthing?
codaddict
The "excessive commenting" is mostly just due to posting on SO. Not the kind of commenting I'd use in my own code. :) But I figure for SO, more information is better than less, since there's no assumption on what any particular reader might know.
Amber
Oh, bzabhi - you might need to modify the `.*` in the lookbehind; your definition of "comes before" for the passphrase bit was a bit vague.
Amber
Dude, your commenting is perfect for us wobbly regex wannabe's! I love the delimiting breakdown of what each part does! Sadly though, this solution isn't working for me. :-(
Murdoch Ripper
Oh right, lookbehind with variable quantifiers is tricky. You want to re-write it with a lookahead instead: `/(?!\D*(?:[Pp]asscode|[Cc]ode))\D*[0-9]{7,10}(?:-[0-9]{2})?/`
Anon.
Erm, anchor that pattern at the start.
Anon.
The look ahead assertion works better, however, now it will match the numbers and everything BEFORE it also. I just need to match the numbers. How can you place boundaries for the numbers?
Murdoch Ripper
Look ahead with word boundaries: (?!\D*(?:[Pp]asscode|[Cc]ode))\b\D?[0-9]{7,10}(?:-[0-9]{2})?\b
Murdoch Ripper
The question says you just need to match all lines that meet the requirements. If you need to extract the number itself with the same regex, use a capturing group around the parts you want: `/^(?!\D*(?:[Pp]asscode|[Cc]ode))\D*([0-9]{7,10}(?:-[0-9]{2})?)/` Then the numbers themselves will be in the first capture buffer.
Anon.
just put a capture group around the numbers by adding parentheses around the part you want to match, and then look at the captured group text instead of the entire match text.
Amber
Beautimus! This works for me!!
Murdoch Ripper
Although, if the tool you're looking at doesn't allow you to look at capture buffers, you might have an issue there. How well-defined is the location of your passphrase text?
Amber
Great! Glad to hear things worked out.
Amber
@Murdoch Ripper: Can you tell us which solution works? The best i could find on this thread fails for 'level1: 01234567' (it doesn't match, but it should).
Mark Byers
The regex I posted will fail if there are any digits not part of the number preceding it. You could probably adjust it a little to use word boundaries and `.` instead of `\D`, which would solve this.
Anon.
@Anon: I still don't think that it would work in all cases. I think using a variable width lookahead is not a suitable approach. You risk looking ahead too far and giving a false negative in the special case I mentioned in the comments. I've already provided a solution that works below.
Mark Byers
+1  A: 

You can get by with a nasty regex you have to get help with ...

... or you can use two simple regexes. One that matches what you want, and one that filters what you don't want. Simpler and more readable.

Which one would you like to read?

$foo =~ /(?<!(?:[Pp]asscode|[Cc]ode).*)[0-9]{7,10}(?:-[0-9]{2})?/

or

$foo =~ /\d{7,10}(-\d{2})?/ and $foo !~ /(access |pass)code/i;

Edit: case-insensitivity.

Alex Brasetvik
Thanks for the comment. I'm stuck with the nasty solution I suppose. Although you're right - having two is better in this case - it won't appease those higher beings called "share holders". This is due to having a software solution which does not accept the "filter" regex. Do you have an example? I could give it a shot, but in testing much simpler cases thus far, it hasn't worked well if at all.
Murdoch Ripper
Your example is what I was asking for. The term "it" = using two simple regexes.
Murdoch Ripper
The first version isn't PCRE and the second version doesn't do what he wants.
Mark Byers