tags:

views:

602

answers:

6

Here's a problem I ran into recently. I have attributes strings of the form

"x=1 and y=abc and z=c4g and ..."

Some attributes have numeric values, some have alpha values, some have mixed, some have dates, etc.

Every string is supposed to have "x=someval and y=anotherval" at the beginning, but some don't. I have three things I need to do.

  1. Validate the strings to be certain that they have x and y.
  2. Actually parse the values for x and y.
  3. Get the rest of the string.

Given the example at the top, this would result in the following variables:

$x = 1;
$y = "abc";
$remainder = "z=c4g and ..."

My question is: Is there a (reasonably) simple way to parse these and validate with a single regular expression? i.e.:

if ($str =~ /someexpression/)
{
    $x = $1;
    $y = $2;
    $remainder = $3;
}

Note that the string may consist of only x and y attributes. This is a valid string.

I'll post my solution as an answer, but it doesn't meet my single-regex preference.

A: 

Here's basically what I did to solve this:

($x_str, $y_str, $remainder) = split(/ and /, $str, 3);

if ($x_str !~ /x=(.*)/)
{
    # error
}

$x = $1;

if ($y_str !~ /y=(.*)/)
{
    # error
}

$y = $1;

I've omitted some additional validation and error handling. This technique works, but it's not as concise or pretty as I would have liked. I'm hoping someone will have a better suggestion for me.

Derek Park
This looks to me simpler and more maintainable than any of the "one regexp to rule them all" solutions. I would maybe just add a ^ at the beginning of theregexps to match x= and y= to avoid the case not_x=... or similar. Why do you want a single regexp?
mirod
+2  A: 

I'm not the best at regular expressions, but this seems pretty close to what you're looking for:

/x=(.+) and y=([^ ]+)( and (.*))?/

Except you use $1, $2, and $4. In use:

my @strs = ("x=1 and y=abc and z=c4g and w=v4l",
            "x=yes and y=no",
            "z=nox and w=noy");

foreach (@strs) {
    if ($_ =~ /x=(.+) and y=([^ ]+)( and (.*))?/) {
        $x = $1;
        $y = $2;
        $remainder = $4;
        print "x: $x; y: $y; remainder: $remainder\n";
    } else {
        print "Failed.\n";
    }
}

Output:

x: 1; y: abc; remainder: z=c4g and w=v4l
x: yes; y: no; remainder: 
Failed.

This of course leaves out plenty of error checking, and I don't know everything about your inputs, but this seems to work.

Rudd Zwolinski
A: 

@Rudd, nice! That's definitely more concise than what I had.

Derek Park
Come on. You were here early on. You know that we didn't always have proper comments. I'm not going to run around the site deleting all my old "answers" that were written as comments.
Derek Park
+2  A: 

As a fairly simple modification to Rudd's version,

/^x=(.+) and y=([^ ]+)(?: and (.*))?/

will allow you to use $1, $2 and $3 (the ?: makes it a noncapturing group), and will ensure that the string starts with "x=" rather than allowing a "not_x=" to match

If you have better knowledge of what the x and y values will be, this should be used to tighten the regex further:

my @strs = ("x=1 and y=abc and z=c4g and w=v4l",
        "x=yes and y=no",
        "z=nox and w=noy",
        "not-x=nox and y=present",
        "x=yes and w='there is no and y=something arg here'");

foreach (@strs) {
    if ($_ =~ /^x=(.+) and y=([^ ]+)(?: and (.*))?/) {
        $x = $1;
        $y = $2;
        $remainder = $3;
        print "x: {$x}; y: {$y}; remainder: {$remainder}\n";
    } else {
        print "$_ Failed.\n";
    }
}

Output:

x: {1}; y: {abc}; remainder: {z=c4g and w=v4l}
x: {yes}; y: {no}; remainder: {}
z=nox and w=noy Failed.
not-x=nox and y=present Failed.
x: {yes and w='there is no}; y: {something}; remainder: {}

Note that the missing part of the last test is due to the current version of the y test requiring no spaces, if the x test had the same restriction that string would have failed.

Cebjyre
+1  A: 

Assuming you also want to do something with the other name=value pairs this is how I would do it ( using Perl version 5.10 ):

use 5.10.0;
use strict;
use warnings;

my %hash;
while(
    $string =~ m{
       (?: ^ | \G )    # start of string or previous match
       \s*

       (?<key>   \w+ ) # word characters
       =
       (?<value> \S+ ) # non spaces

       \s*             # get to the start of the next match
       (?: and )?
    }xgi
){
    $hash{$+{key}} = $+{value};
}

# to make sure that x & y exist
die unless exists $hash{x} and exists $hash{y};

On older Perls ( at least Perl 5.6 );

use strict;
use warnings;

my %hash;
while(
    $string =~ m{
       (?: ^ | \G )   # start of string or previous match
       \s*

       ( \w+ ) = ( \S+ )

       \s*            # get to the start of the next match
       (?: and )?
    }xgi
){
    $hash{$1} = $2;
}

# to make sure that x & y exist
die unless exists $hash{x} and exists $hash{y};

These have the added benefit of continuing to work if you need to work with more data.

Brad Gilbert
+1  A: 

Rudd and Cebjyre have gotten you most of the way there but they both have certain problems:

Rudd suggested:

/x=(.+) and y=([^ ]+)( and (.*))?/

Cebjyre modified it to:

/^x=(.+) and y=([^ ]+)(?: and (.*))?/

The second version is better because it will not confuse "not_x=foo" with "x=foo" but will accept things such as "x=foo z=bar y=baz" and set $1 = "foo z=bar" which is undesirable.

This is probably what you are looking for:

/^x=(\w+) and y=(\w+)(?: and (.*))?/

This disallows anything between the x= and y= options, places and allows and optional " and..." which will be in $3

Frosty