tags:

views:

30

answers:

5

Ok i got this regex:

^[\w\s]+=["']\w+['"]

Now the regex will match:

a href='google'

a href="google"

and also

a href='google"

How can i enforce regex to match its quote?
If first quote is single quote, how can i make the last quote also a single quote not a double quote

A: 

^[\w\s]+="\w+"|^[\w\s]+='\w+'

michid
+3  A: 

Read about backreferences.

^[\w\s]+=(["'])\w+?\1

Note that you want to put a ? after the second + or else it will be greedy. However, in general this is not the right way to parse HTML. Use Beautiful Soup.

katrielalex
i have seen the \1 being use in javascript but not in other language particularly in php.can it be use in other language such php?
slier
Yes. It's part of regex.
katrielalex
Yes, in PHP it works.
Wrikken
glad u solve my problem and thx for the link too
slier
`\w` never matches `["']` so `(["'])\w+?\1` is the same as `=(["'])\w+\1`.
taw
True, but I assume this is a sample of a larger HTML page; what about e.g. `a href="foo" target="_blank" id="bar"`...?
katrielalex
A: 

I am afraid you will have to do it the long way:

^[\w\s]+=("\w+"|'\w+')

More technically, ensuring correct matching / nesting of quotes is not a problem for a regular grammar so for more complex problems you would have to use a proper parser (or perl6 style extended regular expression but they technically do not class as regular expressions).

ternaryOperator
Not true. You can capture the first quote and then backreference it.
katrielalex
Yes but if you do that, your regular expression is technically not a regular expression so my statement holds (although it is a perfectly good approach).
ternaryOperator
A: 

Replace the ['"] with \1 to use a back reference (capture group)

^[\w\s]+=["']\w+\1

AllenG
A: 

What exactly do you want to match? It sounds you want to match:

  • word (tagname)
  • mandatory whitespace
  • word (attr name)
  • optional whitespace
  • =
  • optional whitespace
  • either single quoted or double quoted anything (attr value)

That would be: ^(\w+)\s+(\w+)\s*=\s*(?:'([^']*)'|"([^"]*)")

This will allow matches like:

  • a href='' - empty attr
  • a href='Hello world' - spaces and other non-word characters in quoted part
  • a href="one 'n two" - quotes of different kind in quoted part
  • a href = 'google' - spaces on both sides of =

And disallow things like these that your original regexp allows:

  • a b c href='google' - extra words
  • ='google' - only spaces on the left
  • href='google' - only attr on the left

It still doesn't sound exactly right - you're trying to match a tag with exactly one attribute?

With this regexp, tag name will be in $1, attr name in $2, and attr value in either $3 or $4 (the other being nil - most languages distinguish group not taken with nil vs group taken but empty with "" if you need it).

Regexp that would ensure attr value gets in the same group would be messier if you wanted to allow single quotes in doubly quoted attr value and vice verse - something like ^(\w+)\s+(\w+)\s*=\s*(['"])((?:(?!\3).)*)\3 ((?!) is zero-width negative look-ahead - (?:(?!\3).) means something like [^\3] except the latter isn't supported).

If you don't care about this ^(\w+)\s+(\w+)\s*=\s*(['"])(['"]*)\3 will do just fine (for both $3 will be quote type, and $4 attr value).

By the way re (["'])\w+?\1 above - \w doesn't match quotes, so this ? doesn't change anything.

Having said all that, use a real HTML parser ;-)

These regexps will work in Perl and Ruby. Other languages usually copy Perl's regexp system, but often introduce minor changes so some adjustments might be necessary. Especially the one with negative look-aheads might be unsupported.

taw