tags:

views:

218

answers:

6

I'm trying to come up with a regular expression that will match a fraction (1/2) but not a date (5/5/2005) within a string. Any help at all would be great, all I've been able to come up with is (\d+)\/(\d+) which finds matches in both strings. Thanks in advance for the help.

A: 

if its line input you can try

^(\d+)\/(\d+)$

otherwise use this perhaps

^(\d+)\/(\d+)[^\\]*.
GrayWizardx
+6  A: 

Use negative lookahead and lookbehind.

/(?<![\/\d])(?:\d+)\/(?:\d+)(?![\/\d])/

EDIT: I've fixed my answer to trap for the backtracking bug identified by @j_random_hacker. As proof, I offer the following quick and dirty php script:

<?php
$subject = "The match should include 1/2 but not 12/34/56 but 11/23, now that's ok.";
$matches = array();
preg_match_all('/(?<![\/\d])(?:\d+)\/(?:\d+)(?![\/\d])/', $subject, $matches);
var_dump($matches);
?>

which outputs:

array(1) {
  [0]=>
  array(2) {
    [0]=>
    string(3) "1/2"
    [1]=>
    string(5) "11/23"
  }
}
Asaph
+1 for the 42 seconds you beat me by.
Chris Lutz
@Asaph: It turns out there's a tricky bug here caused by regex backtracking -- see Chris Lutz's answer. -1ing to get your attention, will reimburse you when it's fixed :)
j_random_hacker
@j_random_hacker: I don't believe my answer suffers from the bug you mentioned because my approach is different in a way that maybe you didn't notice. I've updated my answer. Please look at it and if you agree with my evidence, kindly remove your downvote. Eagerly awaiting your response...
Asaph
It does have the bug, but it is only revealed if the denominator has more than one digit. Consider the input text "12/34/56". The `\d+` for the denominator part will initially attempt to match "34", but the match later fails because the next character is "/" which violates the negative lookahead clause. What happens next is that that `\d+` part backtracks to match just "3", which now succeeds because the next character, "4", is not "/". End result: "12/3" is matched.
j_random_hacker
@j_random_hacker: Ah! I see it now. Just needed that "12/34/56" test case to make it jump out at me. I'll fix my answer...
Asaph
@j_random_hacker: Fixed! How does it look now?
Asaph
Looks good to me, +1. It's a tricky one isn't it! :)
j_random_hacker
@j_random_hacker: Thanks. I would +1 your answer but it looks like I've already done so 2 days ago. :) Yes. That was tricky.
Asaph
A: 

this will work: (?<![/]{1})\d+/\d+(?![/]{1})

Michael Morton
Why are you using `[/]{1}` when you could just use `/` (or `\/` if you have to escape it).
Chris Lutz
Habit. It does not, however, change the evaluation at all.
Michael Morton
@Michael: True, it doesn't change the evaluation, but it does complicate it, without introducing any benefit. Why not get rid of them in your answer? Alternatively, if you feel it's OK to leave them there, why not write `[/]{1,1}` instead, or even `(?:[/]{1}){1}` etc.?
j_random_hacker
+8  A: 

Assuming PCRE, use negative lookahead and lookbehind:

(?<![\/\d])(\d+)\/(\d+)(?![\/\d])

A lookahead (a (?=) group) says "match this stuff if it's followed by this other stuff." The contents of the lookahead aren't matched. We negate it (the (?!) group) so that it doesn't match stuff after our fraction - that way, we don't match the group in what follows.

The complement to a lookahead is a lookbehind (a (?<=) group) does the opposite - it matches stuff if it's preceeded by this other stuff, and just like the lookahead, we can negate it (the (?<!) group) so that we can match things that don't follow something.

Together, they ensure that our fraction doesn't have other parts of fractions before or after it. It places no other arbitrary requirements on the input data. It will match the fraction 2/3 in the string "te2/3xt", unlike most of the other examples provided.

If your regex flavor uses //s to delimit regular expressions, you'll have to escape the slashes in that, or use a different delimiter (Perl's m{} would be a good choice here).


Edit: Apparently, none of these regexes work because the regex engine is backtracking and matching fewer numbers in order to satisfy the requirements of the regex. When I've been working on one regex for this long, I sit back and decide that maybe one giant regex is not the answer, and I write a function that uses a regex and a few other tools to do it for me. You've said you're using Ruby. This works for me:

>> def get_fraction(s)
>>   if s =~ /(\d+)\/(\d+)(\/\d+)?/
>>     if $3 == nil
>>       return $1, $2
>>     end
>>   end
>>   return nil
>> end
=> nil
>> get_fraction("1/2")
=> ["1", "2"]
>> get_fraction("1/2/3")
=> nil

This function returns the two parts of the fraction, but returns nil if it's a date (or if there's no fraction). It fails for "1/2/3 and 4/5" but I don't know if you want (or need) that to pass. In any case, I recommend that, in the future, when you ask on Stack Overflow, "How do I make a regex to match this?" you should step back first and see if you can do it using a regex and a little extra. Regular expressions are a great tool and can do a lot, but they don't always need to be used alone.


EDIT 2:

I figured out how to solve the problem without resorting to non-regex code, and updated the regex. It should work as expected now, though I haven't tested it. I also went ahead and escaped the /s since you're going to have to do it anyway.

EDIT 3:

I just fixed the bug j_random_hacker pointed out in my lookahead and lookbehind. I continue to see the amount of effort being put into this regex as proof that a pure regex solution was not necessarily the optimal solution to this problem.

Chris Lutz
+1 for taking an extra 42 seconds to write a paragraph explaining the concepts whereas I lazily provided a link.
Asaph
I +1ed but then noticed that you have escaped one of the slashes (the "centre" one) but not the others -- either they should all be escaped, or all not, correct?
j_random_hacker
@j_random_hacker - My bad. I copied the middle part from his question and then wrote the rest myself. Went ahead and added the escaping disclaimer to the answer too. Thanks for the catch. I'd +1 your answer but I already did that a while ago.
Chris Lutz
Wow, thanks a lot for such a detailed explanation! I'm doing this in ruby which doesn't have look behind so I modified the regex based on j_random_hacker (+1ed it, too bad you can't have multiple correct answers) to be (^|[^\/])(\d+)\/(\d+)(?!\/\d) . Thanks again everyone.
John Duff
Glad to help. Too bad Ruby doesn't have lookbehinds. I wonder why Ruby doesn't just use the PCRE library?
Chris Lutz
hmmm...I started playing around with this in some tests with a few more strings, it's matching 11/12/2007 which is wrong. Any ideas?
John Duff
Silly me. It's backtracking. When "11/12" doesn't match (because it's followed by "/2") it matches "11/1" (which is followed by "2/2" which isn't "\/\d+") so it passes. At this point I'm stumped, because both my answer and j_random_hacker's answers fail to account for this. If this were Perl we could use supergreedy `++` (which is greedy but _doesn't_ backtrack if it fails) but Ruby appears not to have this. When I spend this much time on a regex, I decide that the solution is to use more than one regex. I'll put some code in my answer.
Chris Lutz
Thanks for all the help you've given me with this Chris. I was actually fiddling with the regex a little bit before seeing your latest update and came up with (\s|^)(\d+)\/(\d+)(\s|$) which seems to work for all my test cases so far. If I come across something that doesn't then mixing in some code, as you suggested, should help with things. Thanks again for all the help.
John Duff
That should work. You might also consider the `\D` escape, which is anything _but_ a digit. Using `(^|[^\d\/])` would ensure that the character before the `\d+` isn't a slash _or_ a digit. That would probably solve it. Updating my answer yet again.
Chris Lutz
Tricky stuff this backtracking! I'll update mine to fix this too. Also I think you should remove the initial and final `\d` elements from your regex -- as I understand it, all they currently do is *allow* weirdness like "12/345/67" to match on the "12/34" part (at 1st, the denominator `\d+` part matches "3456" but this later fails; the `\d+` backtracks 1 to match "345" which is now followed by "5/" which does not match `[\/\d]\d` => success). Not that that's likely to be a problem in practice.
j_random_hacker
A: 

Depending on the language you're working with you might try negative-look-ahead or look-behind assertions: in perl (?!pattern) asserts that /pattern/ can't follow the matched string.

Or, again, depending on the language, and anything you know about the context, a word-boundary match (\b in perl) might be appropriate.

+4  A: 

Lookahead is great if you're using Perl or PCRE, but if they are unavailable in the regex engine you're using, you can use:

(^|[^/\d])(\d+)/(\d+)($|[^/\d])

The 2nd and 3rd captured segments will be the numerator and denominator.

If you do use the above in a Perl regex, remember to escape the /s -- or use a different delimiter, e.g.:

m!(?:^|[^/])(\d+)/(\d+)(?:$|[^/])!

In this case, you can use (?:...) to avoid saving the uninteresting parenthesised parts.

EDIT 18/12/2009: Chris Lutz noticed a tricky bug caused by backtracking that plagues most of these answers -- I believe this is now fixed in mine.

j_random_hacker