ansaurus

Question

Puzzle: Splitting An HTML String Correctly

Answer 1

+2 A:

function stripmore($in)
{
    list($p1,$p2) = explode("read_more()",$in,2);

    $pass1 = preg_replace("~>[^<>]+<~","><",$p2);
    $pass2 = preg_replace("~^[^<>]+~","",$pass1);

    $pass3 = null;
    while ( $pass3 != $pass2 )
    {
        if ( $pass3 !== null ) $pass2 = $pass3;
        $pass3 = preg_replace("~<([^<>]+)></\\1>~","",$pass2);
    }

    return $p1."read_more()".$pass3;
}

this strips any non-html after the read_more() mark, and reduces it to the minimum by stripping corresponding tags, while keeping any tag starting before and ending after the mark:

<p>Some text here. read_more()</p>
      ==> <p>Some text here. read_more()</p>

<p>Some <b>text</b> read_more() <b>here</b>.</p>
      ==> <p>Some <b>text</b> read_more()</p>

<p>Some <b>text read_more() here</b>.</p>
      ==> <p>Some <b>text read_more()</b></p>

mvds 2010-08-01 01:41:04

I'm testing this out right now, mvds.

VirtuosiMedia 2010-08-01 01:56:40

Thanks, mvds, this works well. Is it okay if I use your function and if so, how would you like to be credited in the code?

VirtuosiMedia 2010-08-01 02:08:15

use it as you see fit, and as for credits, preferentially not at all. btw you need to strip `~[^<>]+$~` as well (everything after the last tag) and maybe tags like `~<img[^<>]*>~` also.

mvds 2010-08-01 02:12:01

Thanks, I really appreciate the help.

VirtuosiMedia 2010-08-01 02:33:01

Answer 2

A:

Why not use two textareas? One above and below the cut? The should make it obvious to the user what's going on, and eliminate the headache for you.

If you do want to use a token, you should choose something a bit more distinctive. Maybe:  which you can be somewhat more sure isn't actually content being mistaken for a token.

Anyhow, if you want to split the string on the token, you just need to figure out where your token is using strpos() and then use substr() to chop off the first part. Something like:

$intro = substr($text, 0, strpos($string, $token));

Following that, run your $intro through tidy (PHP extension) to clean up the syntax and then strip off the extra crap it adds in there. (I think you can str_replace() the extras with an empty string.)

banzaimonkey 2010-08-01 01:41:52

Tidy, unfortunately, doesn't seem like it will be a valid option because it might not be installed or enabled on all PHP hosts. (This project will be distributed). However, I'm not sure the extent of Tidy's availability, so feel free to correct me if I'm wrong. Two textareas would definitely solve the problem, but I'm trying to keep the user interface light, if possible, so I'd like to explore other options first.

VirtuosiMedia 2010-08-01 01:53:18

Answer 3

+1 A:

The only correct option I currently see is writing your own context-free grammar HTML parser in PHP which will allow you to close the tags appropriately (simply by popping the stack when reaching read more() and for each pop adding a closing tag).

This is, however, a lot of work and this might work well for you:

$stripped = strip_tags($input);
list($preview) = explode("read more()", $stripped);

You lose the HTML markup but it's dead easy to implement. And no possible XSS on your front page :)

dark_charlie 2010-08-01 01:41:57

Losing the HTML markup is a non-option, but thanks for the suggestion.

VirtuosiMedia 2010-08-01 01:45:56

+1 for the first paragraph about writing a parser - that's what I did for my own blog. It basically goes through the text from the beginning and keeps a stack of the currently open HTML tags, then once it determines where to break the text, it appends whatever closing tags are necessary. Mine's a little more complicated because I don't have an explicit token to mark the split - and it's in Python - but if you like, I would be willing to share the code.

David Zaslavsky 2010-08-01 02:14:48

ah, never mind, I see you got something better

David Zaslavsky 2010-08-01 02:17:43

Thanks for the offer, David.

VirtuosiMedia 2010-08-01 03:22:41

Answer 4

+1 A:

Instead of using full HTML, why not use one of the many markup languages that can generate HTML, but which don't require you to close tags, etc. It would be easier to train your users, and would avoid all of the possibilities for XSS attacks that accepting raw HTML allows.

PHP Markdown would seem an obvious fit, particularly in light of your desire to avoid the GNU GPL.

Craig Trader 2010-08-01 01:50:29

It's for the admin section of a CMS, so I'd prefer to have as little of a learning curve as possible. I chose CKEditor because it's a bit more feature rich than markdown editors and it allows non-technical users something closer to Word. I am filtering the input. Thanks for the suggestion, though.

VirtuosiMedia 2010-08-01 01:59:53

So ... given the availability of WordPress, Drupal, Joomla, and a score of other Open Source CMSystems, why are you writing another? Just curious.

Craig Trader 2010-08-01 10:57:44

Answer 5

+1 A:

In order to answer a comment to my comment I decided to have it be an answer, so I can take advantage of the markup options.

Why can't you just use trim() on the resulting string, find the missing open or close element and append that appropriately, to make it valid HTML?

Just traverse forward and back to find the next open/close element, and fix your HTML.

So, you can just walk forward and back in the string to get the next < and >, and if that is an HTML element then stop there, otherwise keep going.

Ideally you should need to process this once per submission, so you keep paying the price to do this operation.

UPDATE:

I forgot to include a link to help with strpos:

http://tuxradar.com/practicalphp/4/7/5

James Black 2010-08-01 02:02:11

ansaurus

tags:

views:

answers:

Puzzle: Splitting An HTML String Correctly

related questions