ansaurus

Question

How to get the length of a string containing character references while counting the character references as one single character?

Answer 1

A:

Have a look at mb_strlen

Benjamin Cremer 2010-10-14 12:34:30

Answer 2

A:

mb_strlen('string' , 'UTF-8');

Alexander.Plutov 2010-10-14 12:36:13

His strings aren't UTF-8, they literally contain f9

Paul Dixon 2010-10-14 12:37:06

@Paul Dixon -- the question has been edited. It didn't contain entities when the question was originally asked; the entities have been edited in, and I don't believe they are what the questioner intended. (I presume it was you that downvoted all the mb_strlen() answers?)

Spudley 2010-10-14 13:23:48

Answer 3

A:

As you seem to already know, strlen() will return the wrong value when you use it with unicode strings.

But there is a ubicode-safe alternative called mb_strlen().

See them manual for more info: http://uk3.php.net/manual/en/function.mb-strlen.php

Spudley 2010-10-14 12:36:17

Answer 4

+1 A:

As your strings contain literal encodings of unicode chars (rather than being, say, UTF-8 encoded) you could obtain the length by simply replacing them with a dummy char, thus:

$length=strlen(preg_replace('/&#[0-9a-f]{4}/', '_', $raw));

If they were encoded with something PHP understands, like UTF-8, you could use mb_strlen intead.

Paul Dixon 2010-10-14 12:39:45

Due to Unicode normalization, this may erroneously report the length of `` as being 2, when it should only be 1 (the letter `é`).

Victor Nicollet 2010-10-14 12:56:20

The exact specification of the encoding he used isn't specified, but the regex could be loosened, an exercise for the reader and all that :) I'll guess the OP intended html style entities, in which case Gumbo's answer is a good one.

Paul Dixon 2010-10-14 13:05:31

Answer 5

+1 A:

strlen is a single-byte string function that fails on mutli-byte strings as it only returns the number of bytes rather than the number of characters (since in single-byte strings every byte represents one character).

For multi-byte strings use strlen’s multi-byte counterpart mb_strlen instead and don’t forget to specify the proper character encoding.

And to have HTML character references being interpreted as a single character, use html_entity_decode to replace them by the characters they represent:

$str = html_entity_decode('Stack&#x00f9;', ENT_QUOTES, 'UTF-8');
var_dump(mb_strlen($str, 'UTF-8'));  // int(6)

Note that &#00f9 is not a valid character reference as it’s missing a x or X after &# for the hexadecimal notation and a ; after the hexadecimal value.

Gumbo 2010-10-14 12:39:58

Answer 6

+2 A:

I would go with:

$len = mb_strlen(html_entities_decode($myString, ENT_QUOTES, 'UTF-8'),'UTF-8');

Although I would first question why you have HTML entities inside your strings, as opposed to manipulating actual UTF-8 encoded strings.

Also, be careful in that your HTML entities are not correctly written (they need to end with a semicolon). If you do not add the semicolon, any entity-related functions will fail, and many browsers will fail to render your entities correctly.

Victor Nicollet 2010-10-14 12:44:19

ansaurus

tags:

views:

answers:

How to get the length of a string containing character references while counting the character references as one single character?

related questions