ansaurus

Question

Regex Enforcing match

Answer 1

A:

^[\w\s]+="\w+"|^[\w\s]+='\w+'

michid 2010-07-28 22:02:58

Answer 2

+3 A:

Read about backreferences.

^[\w\s]+=(["'])\w+?\1

Note that you want to put a ? after the second + or else it will be greedy. However, in general this is not the right way to parse HTML. Use Beautiful Soup.

katrielalex 2010-07-28 22:03:05

i have seen the \1 being use in javascript but not in other language particularly in php.can it be use in other language such php?

slier 2010-07-28 22:08:19

Yes. It's part of regex.

katrielalex 2010-07-28 22:09:22

Yes, in PHP it works.

Wrikken 2010-07-28 22:09:58

glad u solve my problem and thx for the link too

slier 2010-07-28 22:13:07

`\w` never matches `["']` so `(["'])\w+?\1` is the same as `=(["'])\w+\1`.

taw 2010-07-28 22:39:42

True, but I assume this is a sample of a larger HTML page; what about e.g. `a href="foo" target="_blank" id="bar"`...?

katrielalex 2010-07-29 08:13:39

Answer 3

A:

I am afraid you will have to do it the long way:

^[\w\s]+=("\w+"|'\w+')

More technically, ensuring correct matching / nesting of quotes is not a problem for a regular grammar so for more complex problems you would have to use a proper parser (or perl6 style extended regular expression but they technically do not class as regular expressions).

ternaryOperator 2010-07-28 22:04:39

Not true. You can capture the first quote and then backreference it.

katrielalex 2010-07-28 22:08:17

Yes but if you do that, your regular expression is technically not a regular expression so my statement holds (although it is a perfectly good approach).

ternaryOperator 2010-07-28 22:12:23

Answer 4

A:

Replace the ['"] with \1 to use a back reference (capture group)

^[\w\s]+=["']\w+\1

AllenG 2010-07-28 22:07:51

Answer 5

A:

What exactly do you want to match? It sounds you want to match:

word (tagname)
mandatory whitespace
word (attr name)
optional whitespace
=
optional whitespace
either single quoted or double quoted anything (attr value)

That would be: ^(\w+)\s+(\w+)\s*=\s*(?:'([^']*)'|"([^"]*)")

This will allow matches like:

a href='' - empty attr
a href='Hello world' - spaces and other non-word characters in quoted part
a href="one 'n two" - quotes of different kind in quoted part
a href = 'google' - spaces on both sides of =

And disallow things like these that your original regexp allows:

a b c href='google' - extra words
='google' - only spaces on the left
href='google' - only attr on the left

It still doesn't sound exactly right - you're trying to match a tag with exactly one attribute?

With this regexp, tag name will be in $1, attr name in $2, and attr value in either $3 or $4 (the other being nil - most languages distinguish group not taken with nil vs group taken but empty with "" if you need it).

Regexp that would ensure attr value gets in the same group would be messier if you wanted to allow single quotes in doubly quoted attr value and vice verse - something like ^(\w+)\s+(\w+)\s*=\s*(['"])((?:(?!\3).)*)\3 ((?!) is zero-width negative look-ahead - (?:(?!\3).) means something like [^\3] except the latter isn't supported).

If you don't care about this ^(\w+)\s+(\w+)\s*=\s*(['"])(['"]*)\3 will do just fine (for both $3 will be quote type, and $4 attr value).

By the way re (["'])\w+?\1 above - \w doesn't match quotes, so this ? doesn't change anything.

Having said all that, use a real HTML parser ;-)

These regexps will work in Perl and Ruby. Other languages usually copy Perl's regexp system, but often introduce minor changes so some adjustments might be necessary. Especially the one with negative look-aheads might be unsupported.

taw 2010-07-28 22:38:14

ansaurus

tags:

views:

answers:

Regex Enforcing match

related questions