views:

101

answers:

4

Does anyone with more knowledge than me about regular expressions know how to split up html code so that all tags and all words are seperated ie.

<p>Some content <a href="www.test.com">A link</a></p>

Is seperated like this:

array = { [0]=>"<p>",
          [1]=>"Some",
          [2]=>"content",
          [3]=>"<a href='www.test.com'>,
          [4]=>"A",
          [5]=>"Link",
          [6]=>"</a>",
          [7]=>"</p>"

I've been using preg_split so far and have either successfully managed to split the string by whitespace or split by tags - but then all the content is in one array element when I eed this to be split to.

Anyone help me out?

+1  A: 

Give Simple HTML Dom Parser a try. HTML is too irregular for regular expressions.

Mike B
+2  A: 

You could check out Simple HTML DOM Parser

Or look at the DOM parser in PHP

wkw
Thanks for this! However I need to easily loop through each element - which is why I was going for a simple array. As far as I can tell there's not an easy way to do this with Dom Parser....?
FlimFlam
I don't have rep enough to comment below, but re: helen's mention of "splitting into 3 evenish chunks while keeping tags closed...". I'd ask whether you're talking about full pages or snippets of html? Because of course you've got the <body> tag/block which covers all the content, and there could be div tags that more or less enclose the entire body.
wkw
A: 

I currently use Simple HTML DOM Parser in several applications and find it to be an excellent tool, even when compared against other HTML parsers written in other languages.

Why exactly are you splitting up HTML into the string of tokens you described? Is not a tree-like structure of DOM elements a better approach for your specific application?

jkndrkn
No - the reason I'm splitting it up so much is because I need to split some html code into 3 evenish chunks - whilst making sure all tags are closed at the point of splitting. So my plan is to loop through word by word - storing what tags are open whilst I go. I'm open to suggestions on better ways of doing this
FlimFlam
Actually - I'm going to approach this in a much cleaner way. Splitting things by <p> tags. Still be in chunks of 3 or 2 but very uneven - not perfect but also avoids any nasty accidents. Thanks for everyones reply!
FlimFlam
Well, Simple HTML DOM Parser could be used to return objects that represent every <p> tag, giving you the option of inspecting their contents as needed.The following example loads HTML stored in $string into a simple_html_dom object, extracts all <p> tags from within and then prints the contents of each <p> tag.$html = str_get_html($string);$p_tags = $html->find('p');foreach ($p_tags as $p_tag) { echo $p_tag->innertext;}
jkndrkn
Whoops, sorry about the formatting :[ Still new to the site.
jkndrkn
A: 

preg_split shouldn't be used in that case. Try preg_match_all:

$text = '<p>Some content <a href="www.test.com">A link</a></p>';
preg_match_all('/<[^>]++>|[^<>\s]++/', $text, $tokens);
print_r($tokens);

output:

Array
(
    [0] => Array
        (
            [0] => <p>
            [1] => Some
            [2] => content
            [3] => <a href="www.test.com">
            [4] => A
            [5] => link
            [6] => </a>
            [7] => </p>
        )

)

I assume you forgot to include the 'A' in 'A link' in your example.

Realize that when your HTML contains < or >'s not meant as the start or end of tags, regex will mess things up badly! (hence the warnings)

Bart Kiers
Brill - might not use this any more but it's certainly what I was looking for. Thanks!
FlimFlam
You're welcome Helen.
Bart Kiers