views:

312

answers:

4

I have an html document with multiple commented-out PHP arrays, e.g.:

<!-- Array
(
[key] => 0
)
-->

Using PHP, I need to somehow parse the HTML for only these comments (there are other comments that will need to be ignored) and extract the contents. I've been trying to use preg_match_all but my regex skills aren't up to much. Could anyone point me in the right direction?

Any help is much appreciated!

+2  A: 

How about using a HTML Parser that allows you to access comments (For example Simple HTML DOM) and then check each comment for new lines using strpos.

$html = str_get_html('...HTML HERE...');
$comments = $html->find('comment');
foreach ( $comments as $comment ){
    if ( strpos($comment, "\n") !== false ){
        //process comment
    }
}
Yacoby
Thanks - I wonder if there is a way to do something similar through domDocument?
Ben
A: 

Don't parse HTML with regular expressions. Ever.

Williham Totland
@Yacoby: Did you read the link?
Williham Totland
@Williham Yes. I wouldn't go as far as to say ever. There are situations where it is easier and works just fine. It is like the people who say "never ever use goto" and then come up for the most convoluted method ever for breaking out of nested loops.
Yacoby
I always go so far as to say Ever. And Never. And Always. I love those little guys.
Williham Totland
I can see the rationale for general use, but in this case I know the exact string I'm searching for...
Ben
A: 

Three facts come into play here

  1. there is no place in a HTML document where a literal "<!--" can show up and not mean a comment (everywhere else it would be escaped as "&amp;!--")
  2. you don't seem to want to change the document contents, only find bits in it (search-and-replace has a high probability of breaking the document, search alone has not)
  3. comments cannot be nested in HTML (contrary to normal HTML tags) - this makes all the difference

The above combination means that (lo and behold) regular expressions can be used to identify HTML comments.

Try this regex: <!-- Array([\s\S])*?-->. Match group one will contain everything after "Array" up to the closing sequence of the comment.

You can apply further sanity checking to the found bits to make sure they are in fact what you are looking for.

Tomalak
2. Incorrect: `<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <title>Comments where you don't expect</title> <h1>Comments where you don't expect</h1> <!-- This is a comment --> <div> <img src="http://sstatic.net/so/img/logo.png" alt="<!-- this is just alt text"> </div>`
David Dorward
Just for clarity, the document I'm dealing with is XHTML 1.0 Strict
Ben
@David: Yes, that is the edge case (+1 the comment). My remark would be that it is bad, bad style to use unescaped pointy brackets *anywhere* in the document except for tags (and attribute values are the only place where the `<` is… erm… tolerated). But I admit it might happen somewhere, and of course you need to know if it can happen in your data.
Tomalak
Thank you - this is doing the trick. Sorry to be thick, but is there a way to remove the HTML comment tags and just include the contents?
Ben
@Ben: That you can match them means you can replace them, doesn't it? *Disclaimer: Even though this answer may look like the opposite, I highly dis-recommend the use of regex to process HTML. There are cases where regex may be an acceptable shortcut, but these are rare, far apart and spotting them is not trivial. Finding comments is one such case, but bear in mind that David Dorward's objection above is correct and needs consideration. Proceed at your own risk.*
Tomalak
The counter argument is that `alt="<!--"` is rather more readable (and therefore better style) than `alt="<!--"`
David Dorward
Tomalak
(a) So what? They have special meaning, you can't avoid it (well, you can — so long as the next character is a space or another character in a list I don't have to hand). (b) You can, actually. (c) The SGML specification says otherwise. You might not like it, you might have designed it differently, but it doesn't change that fact.
David Dorward
@David: **a)** HTML is not about avoiding awkward character sequences as much as possible because they hinder reading. Human consumption is not the primary function of HTML, correctly transporting markup *and* data to a user agent is. **b)** I'm not sure about this. You can because HTML parsers are lenient and forgiving. **c)** Show me the part in the spec that allows it. ;) Bottom line is - "it's possible" != "you can". Exmpl: *It's possible* to do this in PHP: `preg_replace("/\d/", "", $s)`, but it's still *wrong* because it must be `preg_replace("/\\d/", "", $s)`;`. Correct escaping is key.
Tomalak
(a) SGML was designed (AFAIK) to be convenient to write. It has a lot of short cuts in it. (b) The validator is not "lenient" or "forgiving". (c) I would, but I don't have a copy of the SGML specification handy and I don't care enough to pay for one.
David Dorward
@David: Pay for one? I was under the impression the SGML spec had to be free? Hm. Anyway. :-) **a)** seems you are right with `alt="<!--"` thing for HTML 4. The W3C parser did not complain. However it barfed big time when switched to XHTML ("Unescaped '<' not allowed in attributes values"). Lucky for me the document the question is about is XHTML. :-D **b)** I didn't say so. ;) **c)** Thanks for actually discussing this instead of silently down-voting and walking away. Much appreciated!
Tomalak
A: 

links of london classic sweetie Onsale is the visualize of the civic as much design as their mothers and grandmothers wore as part of ritual and tradition. These have a keen rate force to links of london bangles make girls decrease in affection with top condition and mean matrimony with pure silver Links of London Jewelry.View Anna Ed Hardy professional profile on LinkedIn. LinkedIn is the online shopping for delicate, she will accept you a complete lady out I will marry him if you are connoisseur in Coolinks. Links of London ornaments are very renowned for donation many styles of [links of london necklaces][2] bracelets with you.

wangjinga