views:

46

answers:

2

Before you say "oh no, not again" here I'm stating my case. I'm parsing part of HTML output and the only thing I'm interested in is name and value attributes of each <input/> tag. HTML is actually HTML fragment, may not be well-formed. I don't have DOM or HTML parser and I don't try to parse nested elements anyway. The problem is that I don't know the order or number of attributes so it could be <input name="foo" value="boo"/> or <input type="hidden" name=foo> or <input id=blah value='boo' src="image.png" name="foo" type="img"/>.

Is there a single regular expression that would get me values of name and value attribute in predictable order? I wouldn't have asked the question if I could assume that name attribute always precedes value but unfortunately this is not the case

A: 

Here is a solution using .NET's regular expression syntax:

var regex = new Regex(@"
        <input
            (
                \s*
                (?<name>[^=]+)
                =
                (['""])
                (?<value>.*?)
                \2
            )*
        \s*/?>
    ", RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);

foreach(Match m in regex.Matches(input))
{
    var names = m.Groups["name"];
    var values = m.Groups["value"];

    for(int i = 0; i < names.Captures.Count; i++)
    {
        Console.WriteLine("Name = {0} Value = {1}",
                names.Captures[i].Value, values.Captures[i].Value);
    }
}

For an input string like:

blah blah <input name="hi" value="world" test='foo' /> blah blah

This will output:

Name = name Value = hi
Name = value Value = world
Name = test Value = foo

It doesn't handle name=value (i.e. no quotes around the value) but that shouldn't be too hard to add support for.

Dean Harding
A: 

To get the values of name and value into the same capturing group, regardless of order, you could try

<input (?=[^>]* name=["']([^'"]*)|)(?=[^>]* value=["']([^'"]*)|)

if your regex implementation supports lookaheads. This assumes that the values are quoted.

Jens