What exactly do you want to match? It sounds you want to match:
- word (tagname)
- mandatory whitespace
- word (attr name)
- optional whitespace
=
- optional whitespace
- either single quoted or double quoted anything (attr value)
That would be: ^(\w+)\s+(\w+)\s*=\s*(?:'([^']*)'|"([^"]*)")
This will allow matches like:
a href=''
- empty attr
a href='Hello world'
- spaces and other non-word characters in quoted part
a href="one 'n two"
- quotes of different kind in quoted part
a href = 'google'
- spaces on both sides of =
And disallow things like these that your original regexp allows:
a b c href='google'
- extra words
='google'
- only spaces on the left
href='google'
- only attr on the left
It still doesn't sound exactly right - you're trying to match a tag with exactly one attribute?
With this regexp, tag name will be in $1
, attr name in $2
, and attr value in either $3
or $4
(the other being nil - most languages distinguish group not taken with nil vs group taken but empty with ""
if you need it).
Regexp that would ensure attr value gets in the same group would be messier if you wanted to allow single quotes in doubly quoted attr value and vice verse - something like ^(\w+)\s+(\w+)\s*=\s*(['"])((?:(?!\3).)*)\3
((?!)
is zero-width negative look-ahead - (?:(?!\3).)
means something like [^\3]
except the latter isn't supported).
If you don't care about this ^(\w+)\s+(\w+)\s*=\s*(['"])(['"]*)\3
will do just fine (for both $3
will be quote type, and $4
attr value).
By the way re (["'])\w+?\1
above - \w
doesn't match quotes, so this ?
doesn't change anything.
Having said all that, use a real HTML parser ;-)
These regexps will work in Perl and Ruby. Other languages usually copy Perl's regexp system, but often introduce minor changes so some adjustments might be necessary. Especially the one with negative look-aheads might be unsupported.