tags:

views:

288

answers:

4

I want to receive an array that contains all the h1 tag values from a text

Example, if this where the given input string:

<h1>hello</h1>
<p>random text</p>
<h1>title number two!</h1>

I need to receive an array containing this:

titles[0] = 'hello',
titles[1] = 'title number two!'

I already figured out how to get the first h1 value of the string but I need all the values of all the h1 tags in the given string.

I'm currently using this to receive the first tag:

function getTextBetweenTags($string, $tagname) 
 {
  $pattern = "/<$tagname ?.*>(.*)<\/$tagname>/";
  preg_match($pattern, $string, $matches);
  return $matches[1];
 }

I pass it the string I want to be parsed and as $tagname I put in "h1". I didn't write it myself though, I've been trying to edit the code to do what I want it to but nothing really works.

I was hoping someone could help me out.

Thanks in advance.

+14  A: 

you could use simplehtmldom:

function getTextBetweenTags($string, $tagname) {
    // Create DOM from string
    $html = str_get_html($string);

    $titles = array();
    // Find all tags 
    foreach($html->find($tagname) as $element) {
        $titles[] = $element->plaintext;
    }
}
kgb
Oooh I didn't know you could do that!
Rimian
+1 for DomParser, handy tool
DavidYell
Is simplehtmldom any faster then DOMDocument or just for those occasions where DOMDocument doesn't exist (although it's enabled by default)?
Wrikken
Thank you for your fast reply and excellent solution! :)
Pieter888
@Wrikken it is userland code, so it doubt it is faster. Dunno why people are so fascinated with it (must be the *simple* in the name), especially because there is also [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.query.html), [phpquery](http://code.google.com/p/phpquery/) or [FluentDom](http://nightly.fluentdom.org/documentation) for alternatives.
Gordon
@Wrikken it isn't faster(almost the same), but handles invalid html better. also much less problems with non-utf encodings...
kgb
@kgb you [claimed this before](http://stackoverflow.com/questions/3220076/php-domdocument-how-can-i-print-an-attribute-of-an-element/3220084#3220084) but I still refute it. DOM handles HTML fine.
Gordon
i've had a better example, but now i could find only this: http://stackoverflow.com/questions/1183482/cant-separate-cells-properly-with-simplehtmldom. the point is - simplehtmldom works if the html is not a valid tree.. and i don't like putting "@" to suppress warnings ;)
kgb
@kgb DOM can load invalid HTML fine if you load it with loadHTML. The only thing not working then is getElementById and that is solely due to the fallback to the HTML4.0 DTD. You can still very much query nodes by ID via XPath then. Also, you do not have to suppress the errors with @ at all. You can use libxml_use_internal_errors and handle any errors by custom error handlers. SimpleHTMLDom isnt more suitable for HTML. It doesnt even use libxml but parses the HTML with String functions.
Gordon
point taken, thanks for clarification..
kgb
I'd take 'can report errors but is configurable to keep quiet' above 'will not tell you when something is up' any time of the week :)
Wrikken
-1 for not using the built in c extension to do the exact same thing (Seriously, why do things in PHP if the exact same thing is built into the PHP core?)... Use `DomDocument` instead...
ircmaxell
+6  A: 
function getTextBetweenTags($string, $tagname){
    $d = new DOMDocument();
    $d->loadHTML($string);
    $return = array();
    foreach($d->getElementsByTagName($tagname) as $item){
        $return[] = $item->textContent;
    }
    return $return;
}
Wrikken
oh. beat me to it. didnt see.
Gordon
+1  A: 
 function getTextBetweenH1($string)
 {
    $pattern = "/<h1>(.*?)<\/h1>/";
    preg_match_all($pattern, $string, $matches);
    return ($matches[1]);
 }
dejavu
Please do not parse HTML with regular expressions! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Török Gábor
Using regex is quite fine here. He isn't parsing HTML. He is matching stuff between `<h1>` and `</h1>`, which is inherently regular. Matching a regular language with regular expressions is quite fine. Drop the mindless "OMG regex cannot be used for anything if there is HTML involved" crap that everybody seems to be hyping. It's not like he is trying to match all of HTML, only a very small subset of the language which happens to be regular.
Daniel Egeberg
@Daniel what if there is attributes to the `<h1>`? What if the headings contain element children?
Gordon
@Gordon: The attribute problem can be solved using this regex: `#<h1(?:"(?:[^\\\"]|\\\.)*"|\'(?:[^\\\\\']|\\\.)*\'|[^\'">])*>(.*?)</h1>#i` (which I believe still describes a regular language and thus can be represented using a finite state machine). The problem with child elements is non-existent because there cannot be an `<h1>` within another `<h1>` anyways. Edit: The regex is written for a single-quoted PHP string.
Daniel Egeberg
@Daniel you have to admit that this is completely unreadable :) Also, there can be inline elements in an h1. What about spans? strongs? ems? The h1 of this very page has a link inside. Regex has no concept of TextNodes. It just knows Strings.
Gordon
+2  A: 

Alternative to DOM. Use when memory is an issue.

$html = <<< HTML
<html>
<h1>hello<span>world</span></h1>
<p>random text</p>
<h1>title number two!</h1>
</html>
HTML;

$reader = new XMLReader;
$reader->xml($html);
while($reader->read() !== FALSE) {
    if($reader->name === 'h1' && $reader->nodeType === XMLReader::ELEMENT) {
        echo $reader->readString();
    }
}
Gordon
thanks, I'm still using the DOM method though. Still thank you for taking your time answering :)
Pieter888
@Pieter yup, I had supplied the DOM solution myself if Wrikken hadnt already done so.
Gordon