tags:

views:

1860

answers:

7
+3  Q: 

PHP utf8 problem

I have some problems comparing an array with Norwegian characters with a utf8 character.

All characters except the special Norwegian characters(æ, ø, å) works fine.

function isNorwegianChar($Char)
{
    $aNorwegianChars = array('a', 'A', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E', 'f', 'F', 'g', 'G', 'h', 'H', 'i', 'I', 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', 'N', 'o', 'O', 'p', 'P', 'q', 'Q', 'r', 'R', 's', 'S', 't', 'T', 'u', 'U', 'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z', 'æ', 'Æ', 'ø', 'Ø', 'å', 'Å', '=', '(', ')', ' ', '-');
    $iArrayLength = count($aNorwegianChars);

    for($iCount = 0; $iCount < $iArrayLength; $iCount++)
    {
     if($aNorwegianChars[$iCount] == $Char)
     {
      return true;
     }
    }

    return false;

}

If anyone has any idea about what I can do pleas let me know.

Update:

The reason for needing this is that I'm trying to parse a text file that contains lines with Norwegian and Chinese words, like a dictionary. I want to split the line in to strings, one containing the Norwegian word and one containing the Chinese. This will later be inserted in a database. Example lines:

impulsiv 形 衝動的

imøtegå 動 反對,反駁

imøtekomme 動 符合

alkoholmisbruk(er) 名 濫用酒精 (名 濫用酒精的人)

alkoholpåvirket 形 受酒精影響的

alkotest 名 呼吸性酒精測試

alkymi(st) 名 煉金術 (名 煉金術士)

all, alt, alle, 形 全部, 所有

As you can see there might be spaces between the words so I can not use something easy like explode to split between the Chinese and Norwegian words. What I do is use the isNorwegianChar and loop through the line until I find a char that is not in the array.

The problem is that it æ, ø and å is not returned as a Norwegian character and it think the Chinese word has started.

Here is the code:

   //Open file.
$rFile = fopen("norsk-kinesisk.txt", "r");

// Loop through the file.
$Count = 0;
while(!feof($rFile))
{
    if(40== $Count)
    {
     break;
    }

    $sLine = fgets($rFile);

    if(0 == $Count)
    {
     $sLine = mb_substr($sLine, 3);
    }

    $iLineLength   = strlen($sLine);
    $bChineseHasStarted = false;
    $sNorwegianWord  = '';
    $sChineseWord   = '';
    for($iCount2 = 0; $iCount2 < $iLineLength; $iCount2++)
    {
     $char = mb_substr($sLine, $iCount2, 1);

     if(($bChineseHasStarted === false) && (false == isNorwegianChar($char)))
     {
      $bChineseHasStarted = true;
     }

     if(false === $bChineseHasStarted)
     {
      $sNorwegianWord .= $char;
     }
     else
     {
      $sChineseWord .= $char;
     }

     //echo $char;
    }

    $sNorwegianWord = trim($sNorwegianWord);
    $sChineseWord = trim($sChineseWord);

    $Count++;
}

fclose($rFile);
+3  A: 

First of all, and I'll get to UTF-8 later if nobody else answers, iterating like you are is a very bad way to search through an array. PHP has built-in functions just for that:

http://fr.php.net/array_search

So you might want to give that a try and see if it helps with your problem. Also make sure that the PHP file you're writing is also encoded in UTF-8!

UPDATE:

Try the following code, which works just fine on my server. If it doesn't work check that PHP is configured to work with UTF-8 by default, or add the necessary ini_set calls.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head><title>norvegian utf-8 test</title>
<meta http-equiv="Content-type" value="text/html; charset=UTF-8" />
</head>

<body>

<?php

function isSpecial($char) {
    $special_chars = array("æ", "ø", "å", "か");
    return (array_search($char, $special_chars) !== false);
}

if (isset($_REQUEST["char"])) {
    echo $_REQUEST["char"].(isSpecial($_REQUEST["char"])?" (true)":" (false)");
}


?>

<form  method="POST" accept-charset="UTF-8">
<input type="text" name="char">
<input type="submit" value="submit">
</form>


</body>
</html>
Gilles
Thanks for the answer. I did use array search, but it did not see that the array ø was the same as the UTF-8 ø, so I made my own loop to test different encodings and stuff. The PHP file is in UTF-8.
Christoffer
@Chistoffer I'll write a test file and give it a try on my server. I use UTF-8 extensively so I know that I have everything configured properly UTF-8-wise.
Gilles
Thank you so much:)
Christoffer
@Christoffer : code added
Gilles
Thanks! Your example works fine on my computer. The error has to be somewhere else. I will update the question with some more information.
Christoffer
@Gilles... I've not seen !==false used like that in a return. Could you explain it please? Also you do not need to use parenthesis around your return args. Since return is a language construct parenthesis actually slow down the processing.
gaoshan88
@gaoshan88 from php.net's article on array_search: This function may return Boolean FALSE, but may also return a non-Boolean value which evaluates to FALSE, such as 0 or "". Please read the section on Booleans for more information. Use the === operator for testing the return value of this function.
Gilles
and obviously ==='s evil twin is !==
Gilles
Cool, thanks Gilles.
gaoshan88
A: 

See if you have mbstring extension installed

Mote
I do have it installed.
Christoffer
+3  A: 

If your PHP script file has an ANSI encoding, instead of UTF-8, then on the byte-level those norwegian characters will be different from what they would be if they were encoded in UTF-8. Since PHP is a byte-processing language, not a text-processing language, it duly compares the byte sequences and concludes they don't match.

To resolve this, you can either make sure that your PHP script has the same encoding as the character set you're comparing against, or you can use the iconv or mbstring libraries to convert to appropriate character sets.

Also, if you haven't read it, read this: http://www.joelonsoftware.com/articles/Unicode.html

Update:
another point you take into account is to make sure that what you're passing into this function is what you think it is. If you're looping across a string one character at a time with the array indexing operator, it won't work, because your UTF-8 string might use two bytes (two array index positions) to store one character. There are functions in mbstring to copy out text from strings based on character positions, not byte positions.

Joeri Sebrechts
Thanks for the answer. My PHP script is saved as UTF-8, mbstring say that the input char is UTF-8 but the array values are ASCII.
Christoffer
ASCII doesn't support norwegian characters, I'm assuming you mean ANSI latin1. In your case I would just output the characters you're trying to compare and look at their byte values.
Joeri Sebrechts
A: 

From what I know, your best bet is to install the mbstring (http://www.php.net/manual/en/ref.mbstring.php) extention if you have access to the webserver.

Benny Wong
A: 

Try using the functions for utf8-encoding and decoding. might help

Mote
+1  A: 

I finally figured it out. It might not be a nice way to do it, but it works.

It seems like the array I was working with was in a different charset than the input character. I solved this by making a string of all the array elements and then use mb_strpos to search for the characters. So the only change to the code is the isNorwegianChar function. The new function looks like this:

function isNorwegianChar($Char)
{
    $sNorwegianChars = "'aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZæÆøØåÅ=() -,";

    if(mb_strpos($sNorwegianChars, $Char))
    {
     return true;
    }
    else
    {
     return false;
    }
}

Thanks for all the help!

Christoffer
A: 

As the problem is to separate Norvegian word(s) from Chinese ones, why don't you use an explicit glyph to do so (I personnaly like "¶"), instead of relying on an algorithm ?

impulsiv¶形 衝動的

Then use mb-split, or mb-substr combined with mb-strpos.

You can easily replace it with a space if you need to output the string!

Sadly, PCRE in PHP doesn't allow us to use \p with script names.

(look for "InMusicalSymbols" in regexp.reference, in § "Unicode character properties", to understand what I mean)

Thanks for the suggestion! The reason for not using a symbol and splitting the string on that symbol is that the file containing the string contains 22 000 lines. And I don't want to edit 22k lines manually.
Christoffer