views:

111

answers:

5

Hi,

I have a string which contains the text of an article. This is sprinkled with BBCodes (between square brackets). I need to be able to grab the first say, 200 characters of an article without cutting it off in the middle of a bbcode. So I need an index where it is safe to cut it off. This will give me the article summary.

  • The summary must be minimum 200 characters but can be longer to 'escape' out of a bbcode. (this length value will actually be a parameter to a function).
  • It must not give me a point inside a stand alone bbcode (see the pipe) like so: [lis|t].
  • It must not give me a point between a start and end bbcode like so: [url="http://www.google.com"]Go To Goo|gle[/url].
  • It must not give me a point inside either the start or end bbcode or in-between them, in the above example.

It should give me the "safe" index which is after 200 and is not cutting off any BBCodes.

Hope this makes sense. I have been struggling with this for a while. My regex skills are only moderate. Thanks for any help!

+2  A: 

Well, the obvious easy answer is to present your "summary" without any bbcode-driven markup at all (regex below taken from here)

$summary = substr( preg_replace( '|[[\/\!]*?[^\[\]]*?]|si', '', $article ), 0, 200 );

However, do do the job you explicitly describe is going to require more than just a regex. A lexer/parser would do the trick, but that's a moderately complicated topic. I'll see if I can come up w/something.

EDIT

Here's a pretty ghetto version of a lexer, but for this example it works. This converts an input string into bbcode tokens.

<?php

class SimpleBBCodeLexer
{
  protected
      $tokens = array()
    , $patterns = array(
        self::TOKEN_OPEN_TAG  => "/\\[[a-z].*?\\]/"
      , self::TOKEN_CLOSE_TAG => "/\\[\\/[a-z].*?\\]/"
    );

  const TOKEN_TEXT      = 'TEXT';
  const TOKEN_OPEN_TAG  = 'OPEN_TAG';
  const TOKEN_CLOSE_TAG = 'CLOSE_TAG';

  public function __construct( $input )
  {
    for ( $i = 0, $l = strlen( $input ); $i < $l; $i++ )
    {
      $this->processChar( $input{$i} );
    }
    $this->processChar();
  }

  protected function processChar( $char=null )
  {
    static $tokenFragment = '';
    $tokenFragment = $this->processTokenFragment( $tokenFragment );
    if ( is_null( $char ) )
    {
      $this->addToken( $tokenFragment );
    } else {
      $tokenFragment .= $char;
    }
  }

  protected function processTokenFragment( $tokenFragment )
  {
    foreach ( $this->patterns as $type => $pattern )
    {
      if ( preg_match( $pattern, $tokenFragment, $matches ) )
      {
        if ( $matches[0] != $tokenFragment )
        {
          $this->addToken( substr( $tokenFragment, 0, -( strlen( $matches[0] ) ) ) );
        }
        $this->addToken( $matches[0], $type );
        return '';
      }
    }
    return $tokenFragment;
  }

  protected function addToken( $token, $type=self::TOKEN_TEXT )
  {
    $this->tokens[] = array( $type => $token );
  }

  public function getTokens()
  {
    return $this->tokens;
  }
}

$l = new SimpleBBCodeLexer( 'some [b]sample[/b] bbcode that [i] should [url="http://www.google.com"]support[/url] what [/i] you need.' );

echo '<pre>';
print_r( $l->getTokens() );
echo '</pre>';

The next step would be to create a parser that loops over these tokens and takes action as it encounters each type. Maybe I'll have time to make it later...

Peter Bailey
Added the lexer example
Peter Bailey
+1  A: 

This does not sound like a job for (only) regex. "Plain programming" logic is a better option:

  • grab a character other than a '[', increase a counter;
  • if you encounter an opening tag, keep advancing until you reach the closing tag (don't increase the counter!);
  • stop grabbing text when your counter has reached 200.
Bart Kiers
+4  A: 

First off, I would suggest considering what you will do with a post that is entirely wrapped in BBcodes, as is often true in the case of a font tag. In other words, a solution to the problem as stated will easily lead to 'summaries' containing the entire article. It may be more valuable to identify which tags are still open and append the necessary BBcodes to close them. Of course in cases of a link, it will require additional work to ensure you don't break it.

krdluzni
Oooo, that is a very, very good point, about what if the whole thing is wrapped in a bbcode. I may have to rethink this.
Sherri
A: 

Here is a start. I don't have access to PHP at the moment, so you might need some tweaking to get it to run. Also, this will not ensure that tags are closed (i.e. the string could have [url] without [/url]). Also, if a string is invalid (i.e. not all square brackets are matched) it might not return what you want.

function getIndex($str, $minLen = 200)
{
  //on short input, return the whole string
  if(strlen($str) <= $minLen)
    return strlen($str);

  //get first minLen characters
  $substr = substr($str, 0, $minLen);

  //does it have a '[' that is not closed?
  if(preg_match('/\[[^\]]*$/', $substr))
  {
    //find the next ']', if there is one
    $pos = strpos($str, ']', $minLen);

    //now, make the substr go all the way to that ']'
    if($pos !== false)
      $substr = substr($str, 0, $pos+1);
  }

  //now, it may be better to return $subStr, but you specifically
  //asked for the index, which is the length of this substring.
  return strlen($substr);
}
Kip
A: 

I wrote this function which should do just what you want. It counts n numbers of characters (except those in tags) and then closes tags which needs to be closed. Example use included in code. The code is in python, but should be really easy to port to other languages, such as php.

def limit(input, length):
  """Splits a text after (length) characters, preserving bbcode"""

  stack = []
  counter = 0
  output = ""
  tag = ""
  insideTag = 0           # 0 = Outside tag, 1 = Opening tag, 2 = Closing tag, 3 = Opening tag, parameters section

  for i in input:
    if counter >= length: # If we have reached the max length (add " and i == ' '") to not make it split in a word
      break
    elif i == '[':        # If we have reached a tag
      insideTag = 1
    elif i == '/':        # If we reach a slash...
      if insideTag == 1:  # And we are in an opening tag
        insideTag = 2
    elif i == '=':        # If we have reached the parameters
      if insideTag >= 1:  # If we actually are in a tag
        insideTag = 3
    elif i == ']':        # If we have reached the closing of a tag
      if insideTag == 2:  # If we are in a closing tag
        stack.pop()       # Pop the last tag, we closed it
      elif insideTag >= 1:# If we are in a tag, parameters or not
        stack.append(tag) # Add current tag to the tag-stack
      if insideTag >= 0:  # If are in some type of tag
        insideTag = 0
        tag = ""
    elif insideTag == 0:  # If we are not in a tag
      counter += 1
    elif insideTag <= 2:  # If we are in a tag and not among the parameters
      tag += i
    output += i

  while len(stack) > 0:
    output += '[/'+stack.pop()+']'   # Add the remaining tags

  return output

cutText = limit('[font]This should be easy:[img]yippee.png[/img][i][u][url="http://www.stackoverflow.com"]Check out this site[/url][/u]Should be cut here somewhere [/i][/font]', 60)
print cutText
Håkon