tags:

views:

1007

answers:

8

In my php application user able to enter the tags(like here while ask question). I assume it will be regexp, and I used one - mb_split('\W+', $text) - to split by non-word characters.

But I want to allow users to enter characters like "-,_,+,#" etc which are valid ones to be in url and are common.

Is there exiting solutions for this, or may be best practicles?

thanks.

+9  A: 

Split by whitespace \s+ instead.

Zach Scrivena
+3  A: 

Split on \s+ (whitespace) instead of \W+ (non-alphanumeric).

chaos
+23  A: 

Use the explode() function and separate by either spaces or commas. Example:

$string = 'tag1 tag-2 tag#3';
$tags = explode(' ', $string); //Tags will be an array
VirtuosiMedia
what is user will input "?" character? it will broke url
waney
It shouldn't break the url.
VirtuosiMedia
Don't forget to urlencode() ANY user input. This will prevent '#' and '?' from breaking the URL.
sirlancelot
+2  A: 

I suppose you could first try to clean up the string before splitting it into tags:

# List characters that you would want to exclude from your tags and clean the string
$exclude = array( '/[?&\/]/', '/\s+/');
$replacements = array('', ' '); 
$tags = preg_replace($exclude, $replacements,  $tags);

# Now split:
$tagsArray = explode(' ', $tags);

You could probably adopt a white list approach to this as well, and rather have characters you accept listed in your pattern.

mike
+2  A: 

You said you wanted it to work like the stackoverflow tagger. This tagger splits them by the whitespace character " ".

If you would like this to be your behavior as well, simply use:

mb_split('\s+', $text)

instead of:

mb_split('\W+', $text)

Good luck!

Willem
A: 

Use preg_match_all.

$tags = array();
if(preg_match_all('/\s*(.*)\s*/U',$tags)) unset($tags[0]);
//now in $tags you have an array of tags.

if tags are in UTF-8, add u modifier to the regexp.

vartec
+1  A: 

I use this smart_explode () function to parse tags in my app:

function smart_explode ($exploder, $string, $sort = '') {
  if (trim ($string) != '') {
    $string = explode ($exploder, $string);
    foreach ($string as $i => $k) {
      $string[$i] = trim ($k);
      if ($k == '') unset ($string[$i]);
    }
    $u = array_unique ($string);
    if ('sort' == $sort) sort ($u);
    return $u;
  } else {
    return array ();
  }
}

It explodes a $string into an array by using $exploder as a separator (usually a comma), removes the duplicated, trims the spaces around tags, and even sorts the tags for you if $sort is 'sort'. It will return an empty array when nothing is inside the $string.

The usage is like:

$mytaglist = smart_explode (',', '  PHP,  ,,regEx ,PHP');

The above will return:

array ('PHP', 'regEx')

To filter the characters you don’t like, do a

 $mytaglist = str_replace (array ('?', '$', '%'), '_', $mytaglist);

before smart_exploding (listing the “bad” characters in the array to get replaced with an underscore).

Ilya Birman
+1  A: 

The correct approach to handling tags depends on your preferences on processing input: You can either remove the invalid tags entirely, or try and clean the tags so they become valid.

Whitelisting approach to defining valid characters should be used in cleaning the input - there is simply too many problematic characters to blacklist.

mb_internal_encoding('utf8');

$tags= 'to# do!"¤ fix-this str&ing';
$allowedLetters='\w';
// Note that the hyphen must be first or last in a character class pattern,
// to match hyphens, instead of specifying a character set range
$allowedSpecials='_+#-';

The first approach removes invalid tags entirely:

// The first way: Ignoring invalid tags

$tagArray = mb_split(' ', $tags);

$pattern = '^[' . $allowedLetters . $allowedSpecials . ']+$';

$validTags = array();
foreach($tagArray as $tag)
{
    $tag = trim($tag);
    $isValid = mb_ereg_match($pattern, $tag);
    if ($isValid)
        $validTags[] = $tag;
}

The second approach tries to clean the tags:

// The second way: Cleaning up the tag input

// Remove non-whitelisted characters
$pattern = '[^' . $allowedLetters . $allowedSpecials .']';

$cleanTags = mb_ereg_replace($pattern, ' ', $tags);

// Trim multiple white spaces.
$pattern = '\s+';
$cleanTags = mb_ereg_replace($pattern, ' ', $cleanTags);

$tags = mb_split(' ',$cleanTags);

Replacing illegal characters with whitespace leads to problems sometimes - for example the above "str&ing" is converted to "str ing". Removing the illegal characters entirely would result in "string", which is more useful in some cases.

Jukka Dahlbom