The correct approach to handling tags depends on your preferences on processing input: You can either remove the invalid tags entirely, or try and clean the tags so they become valid.
Whitelisting approach to defining valid characters should be used in cleaning the input - there is simply too many problematic characters to blacklist.
mb_internal_encoding('utf8');
$tags= 'to# do!"¤ fix-this str&ing';
$allowedLetters='\w';
// Note that the hyphen must be first or last in a character class pattern,
// to match hyphens, instead of specifying a character set range
$allowedSpecials='_+#-';
The first approach removes invalid tags entirely:
// The first way: Ignoring invalid tags
$tagArray = mb_split(' ', $tags);
$pattern = '^[' . $allowedLetters . $allowedSpecials . ']+$';
$validTags = array();
foreach($tagArray as $tag)
{
$tag = trim($tag);
$isValid = mb_ereg_match($pattern, $tag);
if ($isValid)
$validTags[] = $tag;
}
The second approach tries to clean the tags:
// The second way: Cleaning up the tag input
// Remove non-whitelisted characters
$pattern = '[^' . $allowedLetters . $allowedSpecials .']';
$cleanTags = mb_ereg_replace($pattern, ' ', $tags);
// Trim multiple white spaces.
$pattern = '\s+';
$cleanTags = mb_ereg_replace($pattern, ' ', $cleanTags);
$tags = mb_split(' ',$cleanTags);
Replacing illegal characters with whitespace leads to problems
sometimes - for example the above "str&ing" is converted to "str ing".
Removing the illegal characters entirely would result in "string", which
is more useful in some cases.