views:

90

answers:

2

I'm currently writing a function for parsing some HTML and adding tags where necessary. Basically i have a piece of HTML like this:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse feugiat, nunc at vestibulum egestas.

<script type="c">
    #include &lt;stdio.h&gt; 
    #define debug(var) printf(#var &quot; = %d\n&quot;, var)
    int main(void)
    {
     int x = 12;
     debug(x)
     return 0;
    }
</script>

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse feugiat, nunc at vestibulum egestas.

<h3>Test Heading</h3>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras ultricies luctus metus ut cursus.

<ol>
    <li>One</li>
    <li>Two</li>
    <li>Three</li>
</ol>

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras ultricies luctus metus ut cursus.

If you notice there are no <p> tags around the paragraphs. I would like to parse this HTML and add the correct tags to the different paragraphs of text. Also whatever parser is used, it cannot touch any of the other valid HTML. For example, the headings and list should not be altered.

I've hacked together a solution using PHP and although it works, it's not fast or pretty to look at.

What is the best way to accomplish this?
Is there a nice PHP or Javascript based parser i could use for this?

I need to break the HTML down into elements, add tags and write the assembled HTML back to the page(?)

A: 

Sure there is one http://simplehtmldom.sourceforge.net/

// Create DOM from string

$html = str_get_html('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse feugiat, nunc at vestibulum egestas.

<h3>Test Heading</h3>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras ultricies luctus metus ut cursus.

<ol>
    <li>One</li>
    <li>Two</li>
    <li>Three</li>
</ol>

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras ultricies luctus metus ut cursus.
');

$es = $html->find('text');


echo $es; // Output: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras ultricies luctus metus ut cursus.

now you have to make something with that text; like this echo "<p>$es</p>"; now your text is in a <p> tag

streetparade
Not a bad idea but using the 'text' element in the find method also returns the code from inside the script tag. You can't get the tag name from the returned code. I need to differentiate each element in order to only apply the p tag to plain text.
Gary Willoughby
+1  A: 

My suggestion is to use HTML Tidy instead of hacking it together yourself.

$output = tidy_repair_string($input);

See HTML Tidy Configuration Options for a list of options. For what you need the default behaviour is probably fine.

cletus