ansaurus

Question

Finding a point in a string that is not inside BBCodes.

Answer 1

+2 A:

Well, the obvious easy answer is to present your "summary" without any bbcode-driven markup at all (regex below taken from here)

$summary = substr( preg_replace( '|[[\/\!]*?[^\[\]]*?]|si', '', $article ), 0, 200 );

However, do do the job you explicitly describe is going to require more than just a regex. A lexer/parser would do the trick, but that's a moderately complicated topic. I'll see if I can come up w/something.

EDIT

Here's a pretty ghetto version of a lexer, but for this example it works. This converts an input string into bbcode tokens.

<?php

class SimpleBBCodeLexer
{
  protected
      $tokens = array()
    , $patterns = array(
        self::TOKEN_OPEN_TAG  => "/\\[[a-z].*?\\]/"
      , self::TOKEN_CLOSE_TAG => "/\\[\\/[a-z].*?\\]/"
    );

  const TOKEN_TEXT      = 'TEXT';
  const TOKEN_OPEN_TAG  = 'OPEN_TAG';
  const TOKEN_CLOSE_TAG = 'CLOSE_TAG';

  public function __construct( $input )
  {
    for ( $i = 0, $l = strlen( $input ); $i < $l; $i++ )
    {
      $this->processChar( $input{$i} );
    }
    $this->processChar();
  }

  protected function processChar( $char=null )
  {
    static $tokenFragment = '';
    $tokenFragment = $this->processTokenFragment( $tokenFragment );
    if ( is_null( $char ) )
    {
      $this->addToken( $tokenFragment );
    } else {
      $tokenFragment .= $char;
    }
  }

  protected function processTokenFragment( $tokenFragment )
  {
    foreach ( $this->patterns as $type => $pattern )
    {
      if ( preg_match( $pattern, $tokenFragment, $matches ) )
      {
        if ( $matches[0] != $tokenFragment )
        {
          $this->addToken( substr( $tokenFragment, 0, -( strlen( $matches[0] ) ) ) );
        }
        $this->addToken( $matches[0], $type );
        return '';
      }
    }
    return $tokenFragment;
  }

  protected function addToken( $token, $type=self::TOKEN_TEXT )
  {
    $this->tokens[] = array( $type => $token );
  }

  public function getTokens()
  {
    return $this->tokens;
  }
}

$l = new SimpleBBCodeLexer( 'some [b]sample[/b] bbcode that [i] should [url="http://www.google.com"]support[/url] what [/i] you need.' );

echo '<pre>';
print_r( $l->getTokens() );
echo '</pre>';

The next step would be to create a parser that loops over these tokens and takes action as it encounters each type. Maybe I'll have time to make it later...

Peter Bailey 2009-07-28 20:38:58

Added the lexer example

Peter Bailey 2009-07-28 22:03:10

Answer 2

+1 A:

This does not sound like a job for (only) regex. "Plain programming" logic is a better option:

grab a character other than a '[', increase a counter;
if you encounter an opening tag, keep advancing until you reach the closing tag (don't increase the counter!);
stop grabbing text when your counter has reached 200.

Bart Kiers 2009-07-28 20:39:59

Answer 3

+4 A:

First off, I would suggest considering what you will do with a post that is entirely wrapped in BBcodes, as is often true in the case of a font tag. In other words, a solution to the problem as stated will easily lead to 'summaries' containing the entire article. It may be more valuable to identify which tags are still open and append the necessary BBcodes to close them. Of course in cases of a link, it will require additional work to ensure you don't break it.

krdluzni 2009-07-28 20:40:10

Oooo, that is a very, very good point, about what if the whole thing is wrapped in a bbcode. I may have to rethink this.

Sherri 2009-07-29 16:31:06

Answer 4

A:

Here is a start. I don't have access to PHP at the moment, so you might need some tweaking to get it to run. Also, this will not ensure that tags are closed (i.e. the string could have [url] without [/url]). Also, if a string is invalid (i.e. not all square brackets are matched) it might not return what you want.

function getIndex($str, $minLen = 200)
{
  //on short input, return the whole string
  if(strlen($str) <= $minLen)
    return strlen($str);

  //get first minLen characters
  $substr = substr($str, 0, $minLen);

  //does it have a '[' that is not closed?
  if(preg_match('/\[[^\]]*$/', $substr))
  {
    //find the next ']', if there is one
    $pos = strpos($str, ']', $minLen);

    //now, make the substr go all the way to that ']'
    if($pos !== false)
      $substr = substr($str, 0, $pos+1);
  }

  //now, it may be better to return $subStr, but you specifically
  //asked for the index, which is the length of this substring.
  return strlen($substr);
}

Kip 2009-07-28 20:40:41

Answer 5

A:

I wrote this function which should do just what you want. It counts n numbers of characters (except those in tags) and then closes tags which needs to be closed. Example use included in code. The code is in python, but should be really easy to port to other languages, such as php.

def limit(input, length):
  """Splits a text after (length) characters, preserving bbcode"""

  stack = []
  counter = 0
  output = ""
  tag = ""
  insideTag = 0           # 0 = Outside tag, 1 = Opening tag, 2 = Closing tag, 3 = Opening tag, parameters section

  for i in input:
    if counter >= length: # If we have reached the max length (add " and i == ' '") to not make it split in a word
      break
    elif i == '[':        # If we have reached a tag
      insideTag = 1
    elif i == '/':        # If we reach a slash...
      if insideTag == 1:  # And we are in an opening tag
        insideTag = 2
    elif i == '=':        # If we have reached the parameters
      if insideTag >= 1:  # If we actually are in a tag
        insideTag = 3
    elif i == ']':        # If we have reached the closing of a tag
      if insideTag == 2:  # If we are in a closing tag
        stack.pop()       # Pop the last tag, we closed it
      elif insideTag >= 1:# If we are in a tag, parameters or not
        stack.append(tag) # Add current tag to the tag-stack
      if insideTag >= 0:  # If are in some type of tag
        insideTag = 0
        tag = ""
    elif insideTag == 0:  # If we are not in a tag
      counter += 1
    elif insideTag <= 2:  # If we are in a tag and not among the parameters
      tag += i
    output += i

  while len(stack) > 0:
    output += '[/'+stack.pop()+']'   # Add the remaining tags

  return output

cutText = limit('[font]This should be easy:[img]yippee.png[/img][i][u][url="http://www.stackoverflow.com"]Check out this site[/url][/u]Should be cut here somewhere [/i][/font]', 60)
print cutText

Håkon 2009-08-15 11:17:35

ansaurus

tags:

views:

answers:

Finding a point in a string that is not inside BBCodes.

EDIT

related questions