tags:

views:

102

answers:

5

Hi All,

I need to parse and return the tagname and the attributes in our PHP code files:

<ct:tagname attr="attr1" attr="attr2">

For this purpose the following regular expression has been constructed:

(\<ct:([^\s\>]*)([^\>]*)\>)

This expression works as expected but it breaks when the following code is parsed

<ct:form/input type="attr1" value="$item->field">

The original regular expression breaks because of the > character in the $item->field. I would need to construct a regular expression that ignores the -> or => but not the single >.

I am open to any suggestions... Thanks for your help in advance.

A: 

I think what you want to do is not recognize the -> and =>, but ignore everything between pairs of quotes.

I think it can be done by inserting ((

("[^"]*")*

)) at the opportune place.

Jonas Kölker
+1  A: 

In general, any parsing problem rapidly runs into language constructs that are context-free but not regular. It may be a better[1] solution to write a context-free parser, ignoring everything except the elements you're interested in.

[1] "better" as seen from a viewpoint of Being The Right Thing, not necessarily a return on investment one.

Jonas Kölker
+1  A: 

You could try using negative lookbehind like that:

(\<ct:([^\s\>]*)(.*?)(?<!-|=)\>)

Matches :

<ct:tagname attr="attr1" attr="attr2">
<ct:form/input type="attr1" value="$item->field">

Not sure that it the best suited solution for your case, but that respects the constraints.

madgnome
Thanks, this one is working well for me!
gyurisc
But it causes evitable backtracking.
Gumbo
+2  A: 

Try this:

<ct:([^\s\>]*)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*')\s*)*)>

But if that’s XML, use should better use a XML parser.

Gumbo
+1 for ‘use an XML parser’. You can't parse XML reliably with regex, full stop.
bobince
A: 

My suggestion is to match to the attributes in the same expression.

\<ct:([^\s\>]*)((([a-x0-9]+)=\"([^\"]*)\")*)\>

edit: removed part about > not being valid xml in attribute values.

phq
‘>’ in an attirbute value is perfectly well-formed XML.
bobince