ansaurus

Question

Answer 1

A:

I would break this problem into a few smaller one. It would be easier to write, easier to maintain. And a bit more lines of code of course. The problem with one huge regex is that there are some many gotchas and the input can be invalid which is hard to manage in one big pattern.

/<link([^>]+)>/
-> extract attributes:
   /([\w]+)\s*=\s*"([^"]*)"/

/<style[^>]*>(.+?)</style>/
-> extract inline styles

And finally merge the results into an array as if preg_match_all produced it.

galambalazs 2010-06-30 16:06:22

Answer 2

A:

If I was doing this with regular expressions, e.g. because you need to be able to handle invalid HTML which is often difficult with a proper parser, I would use separate regular expressions. Use one or two regexes to get the style and link tags, and use another set of regexes to get the various attributes from each tag.

Your regex tries to do everything at once by using lookahead to scan the opening tag repeatedly to get all the elements. That's a neat trick in a situation where one regex is all you can use, but not something to be recommended when writing your own code.

I have made some improvements to your regex. I replaced the .*? and .+? with negated character classes where possible for efficiency. The reason why your regex didn't work is that it doesn't correctly try to match the closing tag or correctly handle link tags that have no closing tag. I fixed that.

The regex:

<(link|style)(?=[^<>]*?(?:type="(text/css)"|>))(?=[^<>]*?(?:media="([^<>"]*)"|>))(?=[^<>]*?(?:href="(.*?)"|>))(?=[^<>]*(?:rel="([^<>"]*)"|>))(?:.*?</\1>|[^<>]*>)

PHP:

$pattern = '%<(link|style)(?=[^<>]*?(?:type="(text/css)"|>))(?=[^<>]*?(?:media="([^<>"]*)"|>))(?=[^<>]*?(?:href="(.*?)"|>))(?=[^<>]*(?:rel="([^<>"]*)"|>))(?:.*?</\1>|[^<>]*>)%si'

Jan Goyvaerts 2010-07-04 06:25:58

Answer 3

A:

Thanks at all for your answers, but I finally rewrote that bit using the DOM extension. That should make it way more robust.

Max 2010-07-06 13:24:06

ansaurus

tags:

views:

answers:

extract stylesheets via regex

related questions