views:

37

answers:

5

I am trying to use MS Bing API

$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));

The data returned has a ' ' character in the first character of the returned string. It is not a space because I trimed it before returning the data.

the ' ' character turned out to be %EF%BB%BF

I wonder why this happen, maybe a bug from Microsoft?

My question is simply, How can I remove this %EF%BB%BF in PHP?

Thank You

A: 

To remove it from the beginning of the string (only):

$data = preg_replace('/^%EF%BB%BF/', '', $data);
enobrev
+1  A: 

You could use substr to only get the rest without the UTF-8 BOM:

// if it’s binary UTF-8
$data = substr($data, 3);
// if it’s percent-encoded UTF-8
$data = substr($data, 9);
Gumbo
thanks a lot! how did you learn this
bn
Note: generally speaking, throwing away the BOM is not a good idea. The BOM is there to tell you how the rest of the string should be handled. If you just ignore it, assuming that it's a UTF-8 3-byte BOM, you're setting yourself up for some real problems if/when the encoding ever changes. ... Please have a look at my answer below for more details.
Lee
A: 

It's a byte order mark (http://en.wikipedia.org/wiki/Byte_order_mark), indicating the response is encoded as UTF-8. You can safely remove it, but you should parse the remainder as UTF-8.

Eric Bowman - abstracto -
A: 

$data = str_replace('%EF%BB%BF', '', $data);

You probably shouldn't be using stripslashes -- unless the API returns blackslashed data (and 99.99% chance it doesn't), take that call out.

Coronatus
A: 

You should not simply discard the BOM unless you're 100% sure that the stream will: (a) always be UTF-8, and (b) always have a UTF-8 BOM.

The reasons:

  1. in UTF-8, BOM is optional - so if the service quits sending it at some future point you'll be throwing away the first three chars of your response instead.
  2. The whole purpose of the BOM is to identify unambiguously the type of UTF stream being interpreted UTF-8? -16? or -32?, and also to indicate the 'endian-ness' (byte order) of the encoded info. If you just throw it away you're assuming that you're always getting UTF-8; this may not be a very good assumption.
  3. not all BOM are 3-bytes long, only UTF-8 is 3-bytes. UTF-16 is 2-bytes, and UTF-32 is 4 bytes. So if the service switches to a wider UTF encoding in the future, your code will break.

I think a more appropriate way to handle this would be something like:

/* detect the encoding, then convert from detected encoding to ASCII */
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "ASCII", $enc);

good luck!

Lee