tags:

views:

1105

answers:

3

I have a series of text that contains mixed numbers (ie: a whole part and a fractional part). The problem is that the text is full of human-coded sloppiness:

  1. The whole part may or may not exist (ex: "10")
  2. The fractional part may or may not exist (ex: "1/3")
  3. The two parts may be separated by spaces and/or a hyphens (ex: "10 1/3", "10-1/3", "10 - 1/3").
  4. The fraction itself may or may not have spaces between the number and the slash (ex: "1 /3", "1/ 3", "1 / 3").
  5. There may be other text after the fraction that needs to be ignored

I need a regex that can parse these elements so that I can create a proper number out of this mess.

+3  A: 

Here's a regex that will handle all of the data I can throw at it:

(\d++(?! */))? *-? *(?:(\d+) */ *(\d+))?.*$

This will put the digits into the following groups:

  1. The whole part of the mixed number, if it exists
  2. The numerator, if a fraction exits
  3. The denominator, if a fraction exists

Also, here's the RegexBuddy explanation for the elements (which helped me immensely when constructing it):

Match the regular expression below and capture its match into backreference number 1 «(\d++(?! */))?»
   Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
   Match a single digit 0..9 «\d++»
      Between one and unlimited times, as many times as possible, without giving back (possessive) «++»
   Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?! */)»
      Match the character “ ” literally « *»
         Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
      Match the character “/” literally «/»
Match the character “ ” literally « *»
   Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “-” literally «-?»
   Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character “ ” literally « *»
   Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the regular expression below «(?:(\d+) */ *(\d+))?»
   Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
   Match the regular expression below and capture its match into backreference number 2 «(\d+)»
      Match a single digit 0..9 «\d+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Match the character “ ” literally « *»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match the character “/” literally «/»
   Match the character “ ” literally « *»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match the regular expression below and capture its match into backreference number 3 «(\d+)»
      Match a single digit 0..9 «\d+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character that is not a line break character «.*»
   Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»
Craig Walker
the `.*$` part is pointless, it just throws away what it matches. Just remove it. Other than that, it doesn't look too bad.
Brad Gilbert
I was looking for a similar solution and this is the one that ended up working well for me: http://regexlib.com/REDetails.aspx?regexp_id=2127
DavGarcia
+1  A: 

I think it may be easier to tackle the different cases (full mixed, fraction only, number only) separately from each other. For example:

sub parse_mixed {
  my($mixed) = @_;

  if($mixed =~ /^ *(\d+)[- ]+(\d+) *\/ *(\d)+(\D.*)?$/) {
    return $1+$2/$3;
  } elsif($mixed =~ /^ *(\d+) *\/ *(\d+)(\D.*)?$/) {
    return $1/$2;
  } elsif($mixed =~ /^ *(\d+)(\D.*)?$/) {
    return $1;
  }
}

print parse_mixed("10"), "\n";
print parse_mixed("1/3"), "\n";
print parse_mixed("1 / 3"), "\n";
print parse_mixed("10 1/3"), "\n";
print parse_mixed("10-1/3"), "\n";
print parse_mixed("10 - 1/3"), "\n";
Glomek
+1  A: 

If you are using Perl 5.10, this is how I would write it.

m{
  ^
  \s*       # skip leading spaces

  (?'whole'
   \d++
   (?! \s*[\/] )   # there should not be a slash immediately following a whole number
  )

  \s*

  (?:    # the rest should fail or succeed as a group

    -?        # ignore possible neg sign
    \s*

    (?'numerator'
     \d+
    )

    \s*
    [\/]
    \s*

    (?'denominator'
     \d+
    )
  )?
}x

Then you can access the values from the %+ variable like this:

$+{whole};
$+{numerator};
$+{denominator};
Brad Gilbert