views:

76

answers:

4

sorry, this is a rooky question, but i never had to handle this problem, i think.

how can i get an htmlentity out of a thing like that: \u00e4, which stands for ä (ä) ??

additional information(, why i want to do that ^^):

i have backslashes in the string, for escapereason. when i stripslashes i get something like "u00e4". to find this is not lightweight :-( (i have to stripslashes to be able to store and restore it to the session)

thanks in advise

+1  A: 

How do you mean you have problems reloading it? Do you output it to a HTML page? In that case, you might have set the wrong charset. As for using entities, check this out: htmlentitites

phant0m
no thats not the problem.
helle
Why the downvote? It's a perfectly reasonable answer given the question.
Álvaro G. Vicario
then what is the problem?As Álvaro already said, $_SESSION does not care about encodings.Concerning your edit: If stripslashes removes the slash, it means that the string \00e4 was interpreted literally. You seem to be using the octal notation. See Alvaro's post on how to do it with Hex.
phant0m
downvote, because it does not solves the problem (the question has nothing to do with html output), but had an upvote. if somebody else has the same problem, he would think this is a good answer, and it's not! sorry, don't take it personally ;-)
helle
for your original question, it was a perfectly reasonable answer. Excuse me that I do not possess the power foresight to discover your actual problem :)
phant0m
... what ever ...
helle
+2  A: 

With htmlentities():

<?php

echo htmlentities("\xE4");

?>

However, it's worth noting that:

  1. Sessions do not care about character encoding.
  2. HTML entities are not required in HTML Unicode documents (except for chars with a special meaning in HTML such as < and >).

So this won't fix your problem, it will just hide it ;-)

Update

I had overlooked the reference to \00e4 in the original question. The ä character corresponds to the U+00E4 Unicode code point. However, PHP does not support Unicode code points. If you need to type it in your PHP code and your keyboard does not have such symbol, you can save the document as UTF-8 and then provide the UTF-8 bytes (c3 a4) with the double quote syntax:

<?php
// \[0-7]{1,3} or \x[0-9A-Fa-f]{1,2}
echo "\xc3\xa4";
?>

Still, this has no relation to sessions or HTML. I can't understand what your exact problem is.

Second update

So serialize() cannot handle associative arrays and json_decode() cannot be fed with json_encode()'s output...

<?php

$associative_array = array(
    'foo' => 'ä',
    'bar' => 33,
    'gee' => array(10, 20, 30),
);

var_dump($associative_array);
echo PHP_EOL;
var_dump(serialize($associative_array));
echo PHP_EOL;
var_dump(unserialize(serialize($associative_array)));
echo PHP_EOL;

var_dump(json_encode($associative_array));
echo PHP_EOL;
var_dump(json_decode(json_encode($associative_array)));
echo PHP_EOL;

?>

...

array(3) {
  ["foo"]=>
  string(2) "ä"
  ["bar"]=>
  int(33)
  ["gee"]=>
  array(3) {
    [0]=>
    int(10)
    [1]=>
    int(20)
    [2]=>
    int(30)
  }
}

string(83) "a:3:{s:3:"foo";s:2:"ä";s:3:"bar";i:33;s:3:"gee";a:3:{i:0;i:10;i:1;i:20;i:2;i:30;}}"

array(3) {
  ["foo"]=>
  string(2) "ä"
  ["bar"]=>
  int(33)
  ["gee"]=>
  array(3) {
    [0]=>
    int(10)
    [1]=>
    int(20)
    [2]=>
    int(30)
  }
}

string(42) "{"foo":"\u00e4","bar":33,"gee":[10,20,30]}"

object(stdClass)#1 (3) {
  ["foo"]=>
  string(2) "ä"
  ["bar"]=>
  int(33)
  ["gee"]=>
  array(3) {
    [0]=>
    int(10)
    [1]=>
    int(20)
    [2]=>
    int(30)
  }
}

It appears to me that you are adding several layers of complexity to a simple script because you are making assumptions about how some PHP functions work instead of checking the manual or testing yourself. At this point, the information provided hardly resembles the original question and we still haven't seen a single line of code.

My advice so far is that you try to stop debugging your app as a whole, divide it into smaller pieces and use var_dump() to find out what each of these parts actually generate. Don't assume things: test stuff yourself. Also, take into account that PHP doesn't Unicode natively as others languages do. Every single task that involves double-byte string handling must be carefully implemented with the appropriate multi-byte functions, which often require to hard-code the character encoding.

Álvaro G. Vicario
so where do u get an entity of \00e4 ? i don't want to parse to much, if i parse \00e4 to \xE4 i can also parse it to ä ...
helle
Actually, `\00e4` means nothing in PHP. It's not a valid octal number because it contains an `e`. How do you get such string in the first place?
Álvaro G. Vicario
it comes from json_encode functioni thaught it is unicode. so what is this? any ideas? thanks
helle
It's json's way of denoting Unicode characters. In that case, you should be able to use json_decode.
phant0m
If you use json_encode() you should be getting `\u00e4`, not `\00e4`. Unlike PHP, JavaScript does support Unicode code points. If you want the original string back, just use json_decode()!
Álvaro G. Vicario
json_decode will work for decoding a string, but doesn't work in for my case, because i json_encode an associative array, to be able to store it to the session (fyi serialize didn't work the way i needed it in this special case!). i don't want to walk through every value, because the array is huge and nested.
helle
A: 

I'm not sure about this one helps you but take a look at Wordpress' sanitize_title function where you can find some huge character tables.

fabrik
i am not using wordpress
helle
You don't need to use WordPress. Use some code from linked source if it helps.
fabrik
A: 

As you can see in the discussions, and answeres, it is a problem, which php can't handle native (or until now nobody here knows)

i suggest using this very havy function ... i mean, this is my solution so far, which i do not like, very much.

function parse_umlaut($string){

        $string = str_replace('u00c4', 'Ä', $string);
        $string = str_replace('u00e4', 'ä', $string);
        $string = str_replace('u00d6', 'Ö', $string);
        $string = str_replace('u00f6', 'ö', $string);
        $string = str_replace('u00dc', 'Ü', $string);
        $string = str_replace('u00fc', 'ü', $string);
        $string = str_replace('u00df', 'ß', $string);

        return $string;
 }
helle
why downvote this?
helle
I thought it was a humorous answer... There are more than 100,000 Unicode code points--are you planning to replace them all one by one or just add new lines as your code crashes with new user input? Not to mention that it has a typo (it omits the backslash) and it fixes the wrong problem.
Álvaro G. Vicario
... what ever ...
helle
why can't you just post the code that provides you with which you get these umlaut strings? If we could look at that, maybe we could help you fix your problem.
phant0m