ansaurus

Question

PHP utf8 problem

Answer 1

+3 A:

First of all, and I'll get to UTF-8 later if nobody else answers, iterating like you are is a very bad way to search through an array. PHP has built-in functions just for that:

http://fr.php.net/array_search

So you might want to give that a try and see if it helps with your problem. Also make sure that the PHP file you're writing is also encoded in UTF-8!

UPDATE:

Try the following code, which works just fine on my server. If it doesn't work check that PHP is configured to work with UTF-8 by default, or add the necessary ini_set calls.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head><title>norvegian utf-8 test</title>
<meta http-equiv="Content-type" value="text/html; charset=UTF-8" />
</head>

<body>

<?php

function isSpecial($char) {
    $special_chars = array("æ", "ø", "å", "か");
    return (array_search($char, $special_chars) !== false);
}

if (isset($_REQUEST["char"])) {
    echo $_REQUEST["char"].(isSpecial($_REQUEST["char"])?" (true)":" (false)");
}


?>

<form  method="POST" accept-charset="UTF-8">
<input type="text" name="char">
<input type="submit" value="submit">
</form>


</body>
</html>

Gilles 2008-10-03 12:45:54

Thanks for the answer. I did use array search, but it did not see that the array ø was the same as the UTF-8 ø, so I made my own loop to test different encodings and stuff. The PHP file is in UTF-8.

Christoffer 2008-10-03 12:49:42

@Chistoffer I'll write a test file and give it a try on my server. I use UTF-8 extensively so I know that I have everything configured properly UTF-8-wise.

Gilles 2008-10-03 13:04:58

Thank you so much:)

Christoffer 2008-10-03 13:08:35

@Christoffer : code added

Gilles 2008-10-03 13:34:17

Thanks! Your example works fine on my computer. The error has to be somewhere else. I will update the question with some more information.

Christoffer 2008-10-03 13:40:40

@Gilles... I've not seen !==false used like that in a return. Could you explain it please? Also you do not need to use parenthesis around your return args. Since return is a language construct parenthesis actually slow down the processing.

gaoshan88 2008-10-03 14:00:01

@gaoshan88 from php.net's article on array_search: This function may return Boolean FALSE, but may also return a non-Boolean value which evaluates to FALSE, such as 0 or "". Please read the section on Booleans for more information. Use the === operator for testing the return value of this function.

Gilles 2008-10-03 15:13:20

and obviously ==='s evil twin is !==

Gilles 2008-10-03 15:13:51

Cool, thanks Gilles.

gaoshan88 2008-10-03 15:33:03

Answer 2

A:

See if you have mbstring extension installed

Mote 2008-10-03 12:50:57

I do have it installed.

Christoffer 2008-10-03 12:52:30

Answer 3

+3 A:

If your PHP script file has an ANSI encoding, instead of UTF-8, then on the byte-level those norwegian characters will be different from what they would be if they were encoded in UTF-8. Since PHP is a byte-processing language, not a text-processing language, it duly compares the byte sequences and concludes they don't match.

To resolve this, you can either make sure that your PHP script has the same encoding as the character set you're comparing against, or you can use the iconv or mbstring libraries to convert to appropriate character sets.

Also, if you haven't read it, read this: http://www.joelonsoftware.com/articles/Unicode.html

Update:
another point you take into account is to make sure that what you're passing into this function is what you think it is. If you're looping across a string one character at a time with the array indexing operator, it won't work, because your UTF-8 string might use two bytes (two array index positions) to store one character. There are functions in mbstring to copy out text from strings based on character positions, not byte positions.

Joeri Sebrechts 2008-10-03 12:54:14

Thanks for the answer. My PHP script is saved as UTF-8, mbstring say that the input char is UTF-8 but the array values are ASCII.

Christoffer 2008-10-03 12:58:32

ASCII doesn't support norwegian characters, I'm assuming you mean ANSI latin1. In your case I would just output the characters you're trying to compare and look at their byte values.

Joeri Sebrechts 2008-10-03 13:55:33

Answer 4

A:

From what I know, your best bet is to install the mbstring (http://www.php.net/manual/en/ref.mbstring.php) extention if you have access to the webserver.

Benny Wong 2008-10-03 12:56:10

Answer 5

A:

Try using the functions for utf8-encoding and decoding. might help

Mote 2008-10-03 13:01:44

Answer 6

+1 A:

I finally figured it out. It might not be a nice way to do it, but it works.

It seems like the array I was working with was in a different charset than the input character. I solved this by making a string of all the array elements and then use mb_strpos to search for the characters. So the only change to the code is the isNorwegianChar function. The new function looks like this:

function isNorwegianChar($Char)
{
    $sNorwegianChars = "'aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZæÆøØåÅ=() -,";

    if(mb_strpos($sNorwegianChars, $Char))
    {
     return true;
    }
    else
    {
     return false;
    }
}

Thanks for all the help!

Christoffer 2008-10-03 15:43:05

Answer 7

A:

As the problem is to separate Norvegian word(s) from Chinese ones, why don't you use an explicit glyph to do so (I personnaly like "¶"), instead of relying on an algorithm ?

impulsiv¶形衝動的

Then use mb-split, or mb-substr combined with mb-strpos.

You can easily replace it with a space if you need to output the string!

Sadly, PCRE in PHP doesn't allow us to use \p with script names.

(look for "InMusicalSymbols" in regexp.reference, in § "Unicode character properties", to understand what I mean)

2008-10-05 17:57:15

Thanks for the suggestion! The reason for not using a symbol and splitting the string on that symbol is that the file containing the string contains 22 000 lines. And I don't want to edit 22k lines manually.

Christoffer 2008-10-06 08:20:10

ansaurus

tags:

views:

answers:

PHP utf8 problem

related questions