views:

75

answers:

5

I'm trying to split an HTML string by a token in order to create a blog preview without displaying the full post. It's a little harder than I first thought. Here are the problems:

  • A user will be creating the HTML through a WYSIWYG editor (CKEditor). The markup isn't guaranteed to be pretty or consistent.
  • The token, read_more(), can be placed anywhere in the string, including being nested within a paragraph tag.
  • The resulting first split string needs to be valid HTML for all reasonable uses of the token.

Examples of possible uses:

<p>Some text here. read_more()</p>

<p>Some text read more() here.</p>

<p>read_more()</p>

<p>  read_more()</p>

read_more()

So far, I've tried just splitting the string on the token, but it leaves invalid HTML. Regex is perhaps another option. What strategy would you use to solve this and make it as bulletproof as possible? Any code snippets or hints would also be appreciated (I'm using PHP).

+2  A: 
function stripmore($in)
{
    list($p1,$p2) = explode("read_more()",$in,2);

    $pass1 = preg_replace("~>[^<>]+<~","><",$p2);
    $pass2 = preg_replace("~^[^<>]+~","",$pass1);

    $pass3 = null;
    while ( $pass3 != $pass2 )
    {
        if ( $pass3 !== null ) $pass2 = $pass3;
        $pass3 = preg_replace("~<([^<>]+)></\\1>~","",$pass2);
    }

    return $p1."read_more()".$pass3;
}

this strips any non-html after the read_more() mark, and reduces it to the minimum by stripping corresponding tags, while keeping any tag starting before and ending after the mark:

<p>Some text here. read_more()</p>
      ==> <p>Some text here. read_more()</p>

<p>Some <b>text</b> read_more() <b>here</b>.</p>
      ==> <p>Some <b>text</b> read_more()</p>

<p>Some <b>text read_more() here</b>.</p>
      ==> <p>Some <b>text read_more()</b></p>
mvds
I'm testing this out right now, mvds.
VirtuosiMedia
Thanks, mvds, this works well. Is it okay if I use your function and if so, how would you like to be credited in the code?
VirtuosiMedia
use it as you see fit, and as for credits, preferentially not at all. btw you need to strip `~[^<>]+$~` as well (everything after the last tag) and maybe tags like `~<img[^<>]*>~` also.
mvds
Thanks, I really appreciate the help.
VirtuosiMedia
A: 

Why not use two textareas? One above and below the cut? The should make it obvious to the user what's going on, and eliminate the headache for you.

If you do want to use a token, you should choose something a bit more distinctive. Maybe: <!--full body cut--> which you can be somewhat more sure isn't actually content being mistaken for a token.

Anyhow, if you want to split the string on the token, you just need to figure out where your token is using strpos() and then use substr() to chop off the first part. Something like:

$intro = substr($text, 0, strpos($string, $token));

Following that, run your $intro through tidy (PHP extension) to clean up the syntax and then strip off the extra crap it adds in there. (I think you can str_replace() the extras with an empty string.)

banzaimonkey
Tidy, unfortunately, doesn't seem like it will be a valid option because it might not be installed or enabled on all PHP hosts. (This project will be distributed). However, I'm not sure the extent of Tidy's availability, so feel free to correct me if I'm wrong. Two textareas would definitely solve the problem, but I'm trying to keep the user interface light, if possible, so I'd like to explore other options first.
VirtuosiMedia
+1  A: 

The only correct option I currently see is writing your own context-free grammar HTML parser in PHP which will allow you to close the tags appropriately (simply by popping the stack when reaching read more() and for each pop adding a closing tag).

This is, however, a lot of work and this might work well for you:

$stripped = strip_tags($input);
list($preview) = explode("read more()", $stripped);

You lose the HTML markup but it's dead easy to implement. And no possible XSS on your front page :)

dark_charlie
Losing the HTML markup is a non-option, but thanks for the suggestion.
VirtuosiMedia
+1 for the first paragraph about writing a parser - that's what I did for my own blog. It basically goes through the text from the beginning and keeps a stack of the currently open HTML tags, then once it determines where to break the text, it appends whatever closing tags are necessary. Mine's a little more complicated because I don't have an explicit token to mark the split - and it's in Python - but if you like, I would be willing to share the code.
David Zaslavsky
ah, never mind, I see you got something better
David Zaslavsky
Thanks for the offer, David.
VirtuosiMedia
+1  A: 

Instead of using full HTML, why not use one of the many markup languages that can generate HTML, but which don't require you to close tags, etc. It would be easier to train your users, and would avoid all of the possibilities for XSS attacks that accepting raw HTML allows.

PHP Markdown would seem an obvious fit, particularly in light of your desire to avoid the GNU GPL.

Craig Trader
It's for the admin section of a CMS, so I'd prefer to have as little of a learning curve as possible. I chose CKEditor because it's a bit more feature rich than markdown editors and it allows non-technical users something closer to Word. I am filtering the input. Thanks for the suggestion, though.
VirtuosiMedia
So ... given the availability of WordPress, Drupal, Joomla, and a score of other Open Source CMSystems, why are you writing another? Just curious.
Craig Trader
+1  A: 

In order to answer a comment to my comment I decided to have it be an answer, so I can take advantage of the markup options.

Why can't you just use trim() on the resulting string, find the missing open or close element and append that appropriately, to make it valid HTML?

Just traverse forward and back to find the next open/close element, and fix your HTML.

So, you can just walk forward and back in the string to get the next < and >, and if that is an HTML element then stop there, otherwise keep going.

Ideally you should need to process this once per submission, so you keep paying the price to do this operation.

UPDATE:

I forgot to include a link to help with strpos:

http://tuxradar.com/practicalphp/4/7/5

James Black