views:

53

answers:

3

I am writing a function to dynamically generate my sitemap and sitemap index.

According to the docs on sitemap.org, the file should be encoded in UTF-8.

My function for writing the file is a rather simplistic one, something along the lines of:

function generateFile()
{
  $xml = create_xml();
  $fp = @fopen('sitemap', 'w');
  fwrite($fp, $xml);
  fclose($fp);
}

[Edit - added after comments ]

The create_xml() is simplistic, like so:

function create_xml()
{
return '<?xml version='1.0' encoding='UTF-8'?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
                http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"&gt;
    <url>
        <loc>http://example.com/&lt;/loc&gt;
        <lastmod>2006-11-18</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.8</priority>
    </url>
</urlset>';
}

Is there anything in particular I need to do to ensure that the file is encoded in UTF-8?

Additionally, I would like to gzip the file, rather than leaving it uncompressed. I know how to compress the file AFTER I have saved it to disk. I want to know if (how?), can I compress the file BEFORE writing to disk?

A: 

Yes, you need to make sure your content (the output of create_xml() is encoded as UTF-8. To ensure this, you can use utf8_encode(). You need to make sure the XML file specifies <?xml version="1.0" encoding="UTF-8"?>. And I'd suggest to fopen in the 'wb' mode, the b meaning binary. This will ensure the data gets written exactly as-is.

igorw
Keep in mind that `utf8_encode()` doesn't magically make strings UTF-8. It converts encoding from ISO-8859-1 to UTF-8. If used with other encodings it may give something that is UTF-8-like, but invalid.
porneL
A: 

Your PHP script files should be saved as utf-8.

Also, it's hard to say more without seeing what create_xml() does

Pete
A: 

If you are using only ASCII characters, your file will be always in UTF-8.

Crack