views:

147

answers:

4

I have a CSS file that looks fine when I open it using gedit, but when it's read by PHP (to merge all the CSS files into one), this CSS has the following chars prepended to it: 

PHP removes all whitespace, so a random  in the middle of the code messes up the entire thing. As I mentioned, I can't actually see these chars when I open the file in gedit so can't remove them very easily.

I googled the problem and it's clearly something wrong with the file encoding, which makes sense being as I've been shifting the files around to different linux/windows servers via ftp and rsync, with a range of text editors. I don't really know much about character encoding though, so help would be appreciated.

If it helps, the file is being saved in UTF-8 format, and gedit wont let me save it in ISO-8859-15 format (The document contains one or more characters that cannot be encoded using the specified character encoding). I tried saving it with windows and linux line endings, but neither helped.

+11  A: 

Three words for you:

Byte Order Mark (BOM)

That's the representation for the UTF-8 BOM in ISO-8859-1. You have to tell your editor to not use BOMs or use a different editor to stripe them out.

To automatize the BOM's removal you can use awk as shown here: http://stackoverflow.com/questions/1068650/using-awk-to-remove-the-byte-order-mark

As another answer says, the best would be for PHP to actually interpret the BOM correctly, for that you can use mb_internal_encoding(), like this:

 <?php
   //Storing the previous encoding in case you have some other piece 
   //of code sensitive to encoding and counting on the default value.      
   $previous_encoding = mb_internal_encoding();

   //Set the encoding to UTF-8, so when reading files it ignores the BOM       
   mb_internal_encoding('UTF-8');

   //Process the CSS files...

   //Finally, return to the previous encoding
   mb_internal_encoding($previous_encoding);

   //Rest of the code...
  ?>
Vinko Vrsalovic
Yeah I found that when I googled it, but how do I remove them?
Matt
Remove it manually with vim or something like that
Gus
+1  A: 

If you need to be able to remove the BOM from utf8 encoded files, you first need to get hold of an editor that is aware of them.

I personally use "E Text Editor", whcih is available here.

In the bottom right, there are options for character encoding, including the BOM tag. Load your file, deselect Byte Order Marker if it is selected, resave, and it should be done.

alt text

E is not free, but there is a free trial, and is an excellent editor (limited textmate compatibility).

danp
+2  A: 

I don't know PHP, so I don't know if this is possible, but the best solution would be to read the file as UTF-8 rather than some other encoding. The BOM is actually a ZERO WIDTH NO BREAK SPACE. This is whitespace, so if the file were being read in the correct encoding (UTF-8), then the BOM would be interpreted as whitespace and it would be ignored in the resulting CSS file.

Also, another advantage of reading the file in the correct encoding is that you don't have to worry about characters being misinterpreted. Your editor is telling you that the code page you want to save it in won't do all the characters that you need. If PHP is then reading the file in the incorrect encoding, then it is very likely that other characters besides the BOM are being silently misinterpreted. Use UTF-8 everywhere, and these problems disappear.

Jeffrey L Whitledge
+1  A: 

BOM is just a sequence of characters ($EF $BB $BF for UTF-8), so just remove them using scripts or configure the editor so it's not added.

From Removing BOM from UTF-8:

#!/usr/bin/perl
@file=<>;
$file[0] =~ s/^\xEF\xBB\xBF//;
print(@file);

I am sure it translates to PHP easily.

eed3si9n
Note that the BOM is not a sequence of characters, it is a single character. If the file is in UTF-8, then the character is represented in three *bytes*. If the file is in UTF-8, then viewing it in another encoding (i.e., one in which EF BB BF appears where the BOM should be) is an error. To remove the BOM from a UTF-8 file, one should remove the (single) charcter U+FEFF. Yeah, pedantry!
Jeffrey L Whitledge
I couldn't get that working in PHP (that's just my incompetence, not yours :P), so I did a check to see if the BOM is there and remove the first 3 characters.Here's the code, if anyone needs it:if( substr($css, 0,3) == pack("CCC",0xef,0xbb,0xbf) ) { $css = substr($css, 3);}
Matt