views:

212

answers:

3

I've got a bunch of data which could be mixed characters, special characters, and 'accent' characters, etc.

I've been using php inconv with translit, but noticed today that a bullet point gets converted to 'bull'. I don't know what other characters like this don't get converted or deleted. $, *, %, etc do get removed.

Basically what I'm trying to do is keep letters, but remove just the 'non-language' bits.

This is the code I've been using


        $slugIt = @iconv('UTF-8', 'ASCII//TRANSLIT', $slugIt);  

    $slugIt = preg_replace("/[^a-zA-Z0-9 -]/", "", $slugIt); 

of course, if I move the preg_replace to be above the inconv function, the accent characters will be removed before they are translated, so that doesn't work either.

Any ideas on this? or what non-letter characters are missed in the TRANSLIT?

---------------------Edited--------------------------------- Strangely, it doesn't appear to be the TRANSLIT which is changing a bullet to 'bull'. I commented out the preg-replace, and the 'bull' has been returned to a bullet point. Unfortunately I'm trying to use this to create readable urls, as well as a few other things, so I would still need to do url encoding.

A: 

Try adding the /u modifier to preg_replace. See Pattern Modifers

Glass Robot
I've tried /u, but I'm not sure if I'm using it properly. This is what I have now <pre> $slugIt = @iconv('UTF-8', 'ASCII//TRANSLIT', $slugIt); $slugIt = preg_replace("/[^a-zA-Z0-9 -]/u", "", $slugIt);</pre>i'm still getting the 'bull'. I've also tried putting pre_replace above the iconv, but no joy.
pedalpete
A: 

Hi, you can try using the POSIX Regex:

$slugIt = ereg_replace('[^[:alnum:] -]', '', $slugIt);
$slugIt = @iconv('UTF-8', 'ASCII//TRANSLIT', $slugIt);

[:alnum:] will match any alpha numeric character (including the ones with accent).
Take a look at http://php.net/manual/en/book.regex.php for more information on PHP's POSIX implementation.

Nathan
POSIX regex's and the ereg_* functions are depreciated and not recommended to be used.
zildjohn01
thanks zildjohn, i never would have known that, or even thought to look.
pedalpete
A: 

In the end this turned out to be a combination of wrong character set in, AND how windows handles inconv.

First of all, i had an iso-8859 character set going in, and even though I was defining utf-8 in the head of the document, php was still treating the characterset as ISO.

Secondly, when using iconv in windows, you cannot apparently combine ASCII//TRANSLIT//IGNORE, which thankfully you can do in windows.

Now on linux, all accented characters are translated to their base character, and non-alpha numerics are removed.

Here's the new code

  $slugIt = @iconv('iso-8859-1', 'ASCII//TRANSLIT//IGNORE', $slugIt);  
    $slugIt = preg_replace("/[^a-zA-Z0-9]/", "", $slugIt);  
pedalpete