ansaurus

Question

Answer 1

A:

The easiest way would be to write a regex that picks up the <a .... > part, and then write two more regexes to pull out the class and the title. Although you could probably do it with a single regex, it would be very complicated, and probably a lot more error prone.

With a single regex you would need something like

<a[^>]*((class="([^"]*)")|(title="([^"]*)"))?((title="([^"]*)")|(class="([^"]*)"))?[^>]*>

Which is just a first hand guess without checking to see if it's even valid. Much easier to just divide and conquer the problem.

Kibbee 2009-03-31 01:35:49

Enumerating all permutations might be feasible for to, may be for three attributes, but because the number of permutations grows exponentialy this solution becomes a huge problem very quick.

Daniel Brückner 2009-03-31 01:47:43

Answer 2

A:

An first ad hoc solution might be to do the following.

((class|title)="[^"]*?" *)+

This is far from perfect because it allows every attribute to occur more than once. I could imagine that this might be solveable with assertions. But if you just want to extract the attributes this might already be sufficent.

Daniel Brückner 2009-03-31 01:38:33

Answer 3

+6 A:

No, I believe the best way to do it with a single RE is exactly as you describe. Unfortunately, it'll get very messy when your XML can have 5 different attributes, giving you 125 (I think, it's been quite a while since I did permutations and combinations :-) different REs to check.

On the other hand, I wouldn't be doing this with an RE at all since they're not meant to be programming languages. What's wrong with the old fashioned approach of using an XML processing library?

If you're required to use an RE, this answer probably won't help much, but I believe in using the right tools for the job.

paxdiablo 2009-03-31 01:40:41

Most HTML isn't valid XML. So you'd actually need an HTML parsing library. And depending on why you are trying to pull this information out, it may not warrant writing an application around some library. Maybe it's just a one off thing where you want to get some rough information.

Kibbee 2009-03-31 01:44:58

Unfortunately, I think I have to weigh the value of being able to parse non-valid XML against a ridiculous number of permutations. At a certain point, the regex won't be as trivial. It's not just a one-off project, but I think that I'll have to end up using a library.

VirtuosiMedia 2009-03-31 02:23:48

A few regexes might not be a terrible idea, but it's best not to do everything in one. First, use a regex to get stuff inside <brackets>, then use another to extract elements and such, and process them accordingly. It's much more readable, and easier to write.

Chris Lutz 2009-03-31 02:49:32

+1 trying to parse XML using regex is a fool's game. Proper XML parsers are widely available for all platforms; use them.

bobince 2009-03-31 16:24:35

parsing XML for just specific attributes isn't always "a fool's game", for some things, its really not that complicated if you use proper procedure (tokenizing first, etc), maybe its not the best option for efficiency but if you are just trying to get something specific its not a huge task as you make it out to be and may be faster than finding a decent parser and learning its syntax just to do something simple

Rick 2010-08-19 00:04:59

Answer 4

+4 A:

This is one of the many reasons regexes are not suited to parsing XML or HTML.

Chas. Owens 2009-03-31 01:41:06

regex ins't a programming language, you have to have things like @Josh Bush said above.. its not supposed to be a magic tool that can just parse things for you without any programming to control it

Rick 2010-08-19 00:09:04

@Rick When you finally get a set of regexes and controlling code to the point where it can correctly handle HTML or XML you will have a parser. Why write a new parser when we already have so many good ones?

Chas. Owens 2010-08-19 00:48:56

Answer 5

+1 A:

You could use named groups to pull the attributes out of the tag. Run the regex and then loop over the groups doing whatever tests that you need.

Something like this (untested, using .net regex syntax with the \w for word characters and \s for whitespace):

<a ((?<key>\w+)\s?=\s?['"](?<value>\w+)['"])+ />

Josh Bush 2009-03-31 01:48:04

this is probably the most sensible solution, for just using regex (instead of a pre-built css parser)

Rick 2010-08-19 00:08:21

Answer 6

A:

If you want to match a permutation of a set of elements, you could use a combination of back references and zero-width negative forward matching.

Say you want to match any one of these six lines:

123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-def-789-abc-0AB

You can do this with the following regex:

/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/

The back references (\1, \2), let you refer to your previous matches, and the zero width forward matching ((?!...) ) lets you negate a positional match, saying don't match if the contained matches at this position. Combining the two makes sure that your match is a legit permutation of the given elements, with each possibility only occuring once.

So, for example, in ruby:

input = <<LINES
123-abc-456-abc-789-abc-0AB
123-abc-456-abc-789-def-0AB
123-abc-456-abc-789-ghi-0AB
123-abc-456-def-789-abc-0AB
123-abc-456-def-789-def-0AB
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-abc-0AB
123-abc-456-ghi-789-def-0AB
123-abc-456-ghi-789-ghi-0AB
123-def-456-abc-789-abc-0AB
123-def-456-abc-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-def-789-abc-0AB
123-def-456-def-789-def-0AB
123-def-456-def-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-def-456-ghi-789-def-0AB
123-def-456-ghi-789-ghi-0AB
123-ghi-456-abc-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-abc-789-ghi-0AB
123-ghi-456-def-789-abc-0AB
123-ghi-456-def-789-def-0AB
123-ghi-456-def-789-ghi-0AB
123-ghi-456-ghi-789-abc-0AB
123-ghi-456-ghi-789-def-0AB
123-ghi-456-ghi-789-ghi-0AB
LINES

# outputs only the permutations
puts input.grep(/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/)

For a permutation of five elements, it would be:

/1-(abc|def|ghi|jkl|mno)-
 2-(?!\1)(abc|def|ghi|jkl|mno)-
 3-(?!\1|\2)(abc|def|ghi|jkl|mno)-
 4-(?!\1|\2|\3)(abc|def|ghi|jkl|mno)-
 5-(?!\1|\2|\3|\4)(abc|def|ghi|jkl|mno)-6/x

For your example, the regex would be

/<a href="home.php" (class="link"|title="Home") (?!\1)(class="link"|title="Home")>Home<\/a>/

rampion 2009-03-31 02:35:39

Answer 7

+2 A:

You can create a lookahead for each of the attributes and plug them into a regex for the whole tag. For example, the regex for the tag could be

<a\b[^<>]*>

If you're using this on XML you'll probably need something more elaborate. By itself, this base regex will match a tag with zero or more attributes. Then you add a lookhead for each of the attributes you want to match:

(?=[^<>]*\s+class="link")
(?=[^<>]*\s+title="Home")

The [^<>]* lets it scan ahead for the attribute, but won't let it look beyond the closing angle bracket. Matching the leading whitespace here in the lookahead serves two purposes: it's more flexible than matching it in the base regex, and it ensure that we're matching a whole attribute name. Combining them we get:

<a\b(?=[^<>]*\s+class="link")(?=[^<>]*\s+title="Home")[^<>]+>[^<>]+</a>

Of course, I've made some simplifying assumptions for the sake of clarity. I didn't allow for whitespace around the equals signs, for single-quotes or no quotes around the attribute values, or for angle brackets in the attribute values (which I hear is legal, but I've never seen it done). Plugging those leaks (if you need to) will make the regex uglier, but won't require changes to the basic structure.

Alan Moore 2009-03-31 06:22:36

Answer 8

+3 A:

Have you considered xpath? (where attribute order doesn't matter)

//a[@class and @title]

Will select both <a> nodes as valid matches. The only caveat being that the input must be xhtml (well formed xml).

carwash 2009-03-31 13:44:25

I am using a bit of xpath now, good suggestion.

VirtuosiMedia 2009-03-31 19:21:33

ansaurus

tags:

views:

answers:

Variable order regex syntax

related questions