ansaurus

Question

Regex matching HTML option tags which are unselected and also selected

Answer 1

+1 A:

First, its bad idea to use regex for parsing HTML. Use some html parser. (I am tired of writing this, but I just put it as a first sentence as people tend to downvote immediately without this statement :) )

Anyways, just modify your regex to account for all attributes like this

(<option[^>]*?>([^<]+)<\/option>)

Well, I dont say its an optimal one, its just with minimal modifications to yours

Gopi 2010-09-01 04:47:01

Testing this in regexpal (http://regexpal.com/) doesn't match the unselected item.

reckoner 2010-09-02 10:29:32

I'm using regex to parse HTML because I am using screen-scraper (I have edited question, should have mentioned it earlier!)

reckoner 2010-09-02 10:30:26

Answer 2

+1 A:

Here's an alternative way to load these values in C# using the Html Agility Pack:

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/unasu/");
HtmlNodeCollection options = doc.DocumentNode.SelectNodes("//option[@value]");
IEnumerable<string> values = options.Select(o => o.Attributes["value"].Value);

Loading a local file, for completeness, is done using:

HtmlDocument doc = new HtmlDocument();
doc.Load(@"c:\file.html");

As clearly seen, this solution is a lot more robust than a regex - it won't fail with most code, doesn't care about attributes order, quote formats (single double or none), and many, many more common cases.

Kobi 2010-09-01 05:19:51

Apologies Kobi, should have mentioned earlier that I am using screen-scraper so I am limited to a single regex string - see edited question.

reckoner 2010-09-02 10:31:35

Answer 3

+3 A:

I agree with Kobi but if you really want to use regex here is a solution in perl :

#!/usr/bin/perl
use strict;
use warnings;

while (<DATA>) {
    print $_;
    if (/^(<option value="([^"]+).*?(?:selected="selected")?.*)$/) {
        print "match\t value=$2\n";
    } else {
        print "NOT match\n";
    }
}

__DATA__
<option value="32_1002_ACCT1001" selected="selected">ACCT1001 -- Accounting 1a</option>
<option value="32_1002_ACCT1002">ACCT1002 -- Accounting 1b</option>

output :

<option value="32_1002_ACCT1001" selected="selected">ACCT1001 -- Accounting 1a</option>
match    value=32_1002_ACCT1001
<option value="32_1002_ACCT1002">ACCT1002 -- Accounting 1b</option>
match    value=32_1002_ACCT1002

M42 2010-09-01 08:41:53

First, thanks. It is my understanding that the OP wants to capture all `<option>` tags from an HTML file, and get their values. The `selected` attribute gets in the way of the posted regex. Your solution is pretty close though, it looks better than Gopi's, who removed the relevant capturing group...

Kobi 2010-09-01 09:41:44

Kobi is right, it is just the selected attribute that gets in the way. I couldn't get M42's regex query to work in http://regexpal.com/ either.

reckoner 2010-09-02 10:34:28

@reckoner: Sorry, i misunderstood you needs. I've modified the regex to match all options with or without `selected`. You will find the whole line in `$1` and the value in `$2`.

M42 2010-09-02 11:52:35

Thanks M42, works great. screen-scraper uses PERL compatible regex too.

reckoner 2010-09-02 16:06:40

@reckoner: You're welcome

M42 2010-09-02 17:52:44

ansaurus

tags:

views:

answers:

Regex matching HTML option tags which are unselected and also selected

related questions