tags:

views:

35

answers:

2

I have the following possible strings that I need to turn into arrays so I can feed them into an html generator. I am not staring with html or XML, I am trying to create a shorthand that will allow me to populate my html objects much simpler and faster with more readable code.

id='moo'
id = "foo" type= doo    value ='do\"o'
on_click='monkeys("bobo")'

I need to pull out the attribs and their corresponding values. These attrib strings are not associated with an html or xml tag. And I would like to do it with 1 to 3 regular expressions

  • The value may be encapsulated by either single or double quotes
  • If the value is encapsulated by quotes it may also contain whitespace, quotes different from encapsulating quotes or escaped quotes that are same as the encapsulating quotes.
  • There may or may not be whitespaces between the attrib and =, and the = and value.

The eventual results should look like:

array(1) {
  [id] => moo
}
array(3) {
  [id] => foo
  [type] => doo
  [value] => do"o
}
array(1) {
  [on_click] => monkeys("bobo")
}

but if it turns out like:

array(2) {
  [0] => id
  [1] => moo
}
array(6) {
  [0] => id
  [1] => moo
  [2] => class
  [3] => foo
  [4] => value
  [5] => do"o
}

array(2) {
  [0] => on_click
  [1] => monkeys("bobo")
}

I can re-arrange it from there.

Some previous regexes I have tried to use and their issues:

  • /[\s]+/ - Return attrib/value pairs only if there was no whitespace around the =
  • /(?<==)(\".*\"|'.*'|.*)$/ - Returns value including the encapsulating quotes. It does ignore escaped quotes within the value though
  • /^[^=]*/ - Returns the attribute just fine. regardless of whitespace between attrib and =
+1  A: 

Any particular reason you want to use regex specifically here? Seems like a token-based parser might work better for you, as you need to keep more state than is comfortable to do in a regex.

zigdon
any suggestions on how I should be going about this?
Tyson of the Northwest
State machine, parsing "tokens", and knowing what to expect. Starts with looking for an identifier, then (skipping spaces), an '='. Then one of ', " and word token, followed by the same quote, OR just a word token without quotes. Repeat as needed.
zigdon
A: 

Tyson,

It appears that you have already done some parsing to remove the XML/HTML elements, and are now trying to process the remaining attributes. In general, regular expressions are not sufficient for parsing XML/HTML.

If you have access to the XML/HTML, you should consider using a DOM processing library / extension to PHP to read in the XML/HTML, and iterate/parse the elements and attributes.

Here is an example reference:

bsterne
Unfortunately no, I am putting a framework over the DOM to facilitate faster generation of xml content. I am familer with the dom and I am trying to make a tool that will parse out an attribute string into an array so I can feed it into my dom objects.
Tyson of the Northwest