tags:

views:

77

answers:

3

Can some recommend a regex to return the value when an item is selected as well as unselected as seen below.

<option value="32_1002_ACCT1001" selected="selected">ACCT1001 -- Accounting 1a</option>
<option value="32_1002_ACCT1002">ACCT1002 -- Accounting 1b</option>

My regex currently works only for the unselected option seen below.

(<option value="([^"]+)">([^<]+)<\/option>)

EDIT:

Thanks for the great responses guys, however I should have been a bit more detailed and specific.

I am using it in a screen-scraper extractor pattern as follows:

<option value="~@COURSE_ID@~">~@COURSE_CODE@~ -- ~@COURSE_NAME@~</option>

where ~@COURSE_ID@~ specifies the following regex query:

([^"]+)

Works fine for all option tags EXCEPT the first one which is already selected.

I am testing out your suggestions at the moment, but if anyone wants to jump in with a sure fire solution that would be great.

I'm really struggling with this one, nothing seems to be working!

+1  A: 

First, its bad idea to use regex for parsing HTML. Use some html parser. (I am tired of writing this, but I just put it as a first sentence as people tend to downvote immediately without this statement :) )

Anyways, just modify your regex to account for all attributes like this

(<option[^>]*?>([^<]+)<\/option>)

Well, I dont say its an optimal one, its just with minimal modifications to yours

Gopi
Testing this in regexpal (http://regexpal.com/) doesn't match the unselected item.
reckoner
I'm using regex to parse HTML because I am using screen-scraper (I have edited question, should have mentioned it earlier!)
reckoner
+1  A: 

Here's an alternative way to load these values in C# using the Html Agility Pack:

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/unasu/");
HtmlNodeCollection options = doc.DocumentNode.SelectNodes("//option[@value]");
IEnumerable<string> values = options.Select(o => o.Attributes["value"].Value);

Loading a local file, for completeness, is done using:

HtmlDocument doc = new HtmlDocument();
doc.Load(@"c:\file.html");

As clearly seen, this solution is a lot more robust than a regex - it won't fail with most code, doesn't care about attributes order, quote formats (single double or none), and many, many more common cases.

Kobi
Apologies Kobi, should have mentioned earlier that I am using screen-scraper so I am limited to a single regex string - see edited question.
reckoner
+3  A: 

I agree with Kobi but if you really want to use regex here is a solution in perl :

#!/usr/bin/perl
use strict;
use warnings;

while (<DATA>) {
    print $_;
    if (/^(<option value="([^"]+).*?(?:selected="selected")?.*)$/) {
        print "match\t value=$2\n";
    } else {
        print "NOT match\n";
    }
}

__DATA__
<option value="32_1002_ACCT1001" selected="selected">ACCT1001 -- Accounting 1a</option>
<option value="32_1002_ACCT1002">ACCT1002 -- Accounting 1b</option>

output :

<option value="32_1002_ACCT1001" selected="selected">ACCT1001 -- Accounting 1a</option>
match    value=32_1002_ACCT1001
<option value="32_1002_ACCT1002">ACCT1002 -- Accounting 1b</option>
match    value=32_1002_ACCT1002
M42
First, thanks. It is my understanding that the OP wants to capture all `<option>` tags from an HTML file, and get their values. The `selected` attribute gets in the way of the posted regex. Your solution is pretty close though, it looks better than Gopi's, who removed the relevant capturing group...
Kobi
Kobi is right, it is just the selected attribute that gets in the way. I couldn't get M42's regex query to work in http://regexpal.com/ either.
reckoner
@reckoner: Sorry, i misunderstood you needs. I've modified the regex to match all options with or without `selected`. You will find the whole line in `$1` and the value in `$2`.
M42
Thanks M42, works great. screen-scraper uses PERL compatible regex too.
reckoner
@reckoner: You're welcome
M42