tags:

views:

80

answers:

3

Hi guys,

how can I match a block of

<li>item 1</li>
<li>item 2</li>

no matter there is a blank line before or after the block and enclose it in <ul></ul> tags using PHP's preg_* functions.

Thank you for answer

A: 

Use an XML parser to do this, not a regex. PHP has one built in.

Rich Bradshaw
As much as I agree to this, I think it should be a comment unless you add an example that shows how to do it
Gordon
The code sample is not valid XML - it doesn't have a single top-level element. Use a HTML parser to parse HTML, and I think PHP has one of these built in too. (If not, I linked one in a comment to the question.)
Peter Boughton
+1  A: 

David already stated, that php got xml and html-parsers. However if you really want to use a regex, it probably would be something like:

preg_match('#<li>(.*?)</li>#', $string);
// Same thing
preg_match('#<li>(.*)</li>#U', $string);
faileN
Do you really need to escape the `/` when using `#` as regex boundaries?
nikc
Good point. Just tested it and it also works without escaping. I removed the slashes. Thanks :)
faileN
This works only if each `<li>..</li>` has a line to its self. If there are multiple list elements in on line it will match the whole line instead of each element for it's self.
jigfox
No it works, if there are more <li>-tags on one line. That's why the `?` follows after the `*` or in the second version I used the `U`-modifier to reverse the greediness. If you would leave that out, things would go like you're saying and `item1</li><li>item2` would be matches. That's not the case here.
faileN
Sorry, I didn't recognize this. You're right!
jigfox
there is always one LI pair per line. thank you very much faileN :)
ivan73
You might want to add the `s` modifier to allow the dot to match newlines.
kemp
ivan73 does this answer solve your problem? If so, mark is as the solution, if not explain how it fails.
Peter Boughton
Ok, I tried that and it doesn't so I would try to explain it again, say you have text like this:`----``some pure text line``<li>item1</li>``<li>item2</li>``some pure text line``---`I need to match those items as a whole block and get something like this:`--``some pure text line``<ul>``<li>item1</li>``<li>item2</li>``</ul>``some pure text line``---`using preg_replace
ivan73
Look at the answer I've added then - that should do what you're after.
Peter Boughton
Thanks Peter, but simple_html_dom.php has 36kb of code
ivan73
No, my *answer*, not that comment (which I'll go remove since I've linked to the core implementation in my answer). Here: http://stackoverflow.com/questions/3108812/how-to-match-a-block-of-li-li-using-regexp/3109512#3109512
Peter Boughton
Ok, you made my day :)
ivan73
how about you thank him by accepting his answer
c0mrade
how about look above
ivan73
+2  A: 

If this is safe, controlled input, and you just got LIs with missing parent ULs, you can do:

preg_replace ( '#\s*(?:<li>.*</li>\s*)+#' , '<ul>$0</ul>', $input )

(You may want to add some \n to the replacement string before or after the UL.)

NOTE: This will fail if:

  • There are any existing UL/OL lists in the content.
  • There is anything other than whitespace between consecutive list items.
  • Any of the LIs span multiple lines (the . excludes newline by default).
  • There are any attributes on the LIs.
  • Possibly some things I haven't considered.

Some of these can relatively easily be catered for, but I'm not going to - if you haven't got known specific content, you should be using a real HTML parser instead.

The 'Regular' in Regular Expressions has a specific meaning, and full HTML is not a Regular language, so trying to handle all the intricacies of HTML with simple regex is liable to fail.
If you use a bad regex on user-supplied HTML, you may be introducing HTML injection vulnerabilities into your code.

Peter Boughton
Point #3 is trivial to fix, the rest falls under "controlled input".
kemp
It works :) There is no security issue for me input is fully controlled, thanks for your answer and links Peter.
ivan73