ansaurus

Question

Splitting up html code tags and content

Answer 1

+1 A:

Give Simple HTML Dom Parser a try. HTML is too irregular for regular expressions.

Mike B 2009-11-07 15:32:03

Answer 2

+2 A:

You could check out Simple HTML DOM Parser

Or look at the DOM parser in PHP

wkw 2009-11-07 15:32:03

Thanks for this! However I need to easily loop through each element - which is why I was going for a simple array. As far as I can tell there's not an easy way to do this with Dom Parser....?

FlimFlam 2009-11-07 16:06:54

I don't have rep enough to comment below, but re: helen's mention of "splitting into 3 evenish chunks while keeping tags closed...". I'd ask whether you're talking about full pages or snippets of html? Because of course you've got the <body> tag/block which covers all the content, and there could be div tags that more or less enclose the entire body.

wkw 2009-11-07 20:32:16

Answer 3

A:

I currently use Simple HTML DOM Parser in several applications and find it to be an excellent tool, even when compared against other HTML parsers written in other languages.

Why exactly are you splitting up HTML into the string of tokens you described? Is not a tree-like structure of DOM elements a better approach for your specific application?

jkndrkn 2009-11-07 15:58:52

No - the reason I'm splitting it up so much is because I need to split some html code into 3 evenish chunks - whilst making sure all tags are closed at the point of splitting. So my plan is to loop through word by word - storing what tags are open whilst I go. I'm open to suggestions on better ways of doing this

FlimFlam 2009-11-07 16:16:04

Actually - I'm going to approach this in a much cleaner way. Splitting things by <p> tags. Still be in chunks of 3 or 2 but very uneven - not perfect but also avoids any nasty accidents. Thanks for everyones reply!

FlimFlam 2009-11-07 16:24:47

Well, Simple HTML DOM Parser could be used to return objects that represent every <p> tag, giving you the option of inspecting their contents as needed.The following example loads HTML stored in $string into a simple_html_dom object, extracts all <p> tags from within and then prints the contents of each <p> tag.$html = str_get_html($string);$p_tags = $html->find('p');foreach ($p_tags as $p_tag) { echo $p_tag->innertext;}

jkndrkn 2009-11-07 17:11:14

Whoops, sorry about the formatting :[ Still new to the site.

jkndrkn 2009-11-07 17:12:10

Answer 4

A:

preg_split shouldn't be used in that case. Try preg_match_all:

$text = '<p>Some content <a href="www.test.com">A link</a></p>';
preg_match_all('/<[^>]++>|[^<>\s]++/', $text, $tokens);
print_r($tokens);

output:

Array
(
    [0] => Array
        (
            [0] => <p>
            [1] => Some
            [2] => content
            [3] => <a href="www.test.com">
            [4] => A
            [5] => link
            [6] => </a>
            [7] => </p>
        )

)

I assume you forgot to include the 'A' in 'A link' in your example.

Realize that when your HTML contains < or >'s not meant as the start or end of tags, regex will mess things up badly! (hence the warnings)

Bart Kiers 2009-11-07 16:43:40

Brill - might not use this any more but it's certainly what I was looking for. Thanks!

FlimFlam 2009-11-07 16:58:38

You're welcome Helen.

Bart Kiers 2009-11-07 17:10:49

ansaurus

tags:

views:

answers:

Splitting up html code tags and content

related questions