views:

499

answers:

1

I found this useful regex code here while looking to parse HTML tag attributes:

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

It works great, but it's missing one key element that I need. Some attributes are event triggers that have inline Javascript code in them like this:

onclick="doSomething(this, 'foo', 'bar');return false;"

Or:

onclick='doSomething(this, "foo", "bar");return false;'

I can't figure out how to get the original expression to not count the quotes from the JS (single or double) while it's nested inside the set of quotes that contain the attribute's value.

I SHOULD add that this is not being used to parse an entire HTML document. It's used as an argument in an older "array to select menu" function that I've updated. One of the arguments is a tag that can append extra HTML attributes to the form element.

I've made an improved function and am deprecating the old... but in case somewhere in the code is a call to the old function, I need it to parse these into the new array format. Example:

// Old Function
function create_form_element($array, $type, $selected="", $append_att="") { ... }
// Old Call
create_form_element($array, SELECT, $selected_value, "onchange=\"something(this, '444');\"");

The new version takes an array of attr => value pairs to create extra tags.

create_select($array, $selected_value, array('style' => 'width:250px;', 'onchange' => "doSomething('foo', 'bar')"));

This is merely a backwards compatibility issue where all calls to the OLD function are routed to the new one, but the $append_att argument in the old function needs to be made into an array for the new one, hence my need to use regex to parse small HTML snippets. If there is a better, light-weight way to accomplish this, I'm open to suggestions.

+1  A: 

The problem with your regular expression is that it tries to handle both single and double quotes at the same time. It doesn't support attribute values that contain the other quote. This regex will work better:

(\w+)=("[^<>"]*"|'[^<>']*'|\w+)
Jan Goyvaerts