tags:

views:

455

answers:

3

Hi, i have different websites content stored in a variabel named $content. Now what i would want to do is to search the content for META-tags like this:

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

And then replace the utf-8 to IS0-8859-1. How do i do that with preg_replace?

Note that every occurence is not like that meta-tag. It could be different depending on which website you fetch.

A: 

You don't need to use preg_replace to do that. Just use str_replace:

$content = str_replace('; charset=utf-8', '; charset=ISO-8859-1', $content);
Josh Leitzel
Read what i commented on karim79's post. :)
This looks like the easiest solution... But what is going to happen if there is "; charset=utf-8" in some other place in the page ? For instance, what's going to happen if this section of code is used on this page, that contains the "; charset=utf-8" string quite a couple of times in its content (in this very answer, for instance ^^ ) ?
Pascal MARTIN
@Pascal MARTIN - then a more complete string can be used - see my answer ;)
karim79
A: 

Hi,

What about something like this :

$input = 'sometext<meta http-equiv="Content-type" content="text/html; charset=utf-8" />someothertext';

$output = preg_replace('#<meta http-equiv="Content-type" content="text/html; charset=(utf-8)" />#', 
    '<meta http-equiv="Content-type" content="text/html; charset=IS0-8859-1" />', 
    $input);

var_dump($output);

Which simply replaces the first string by the second one, giving you :

string 'sometext<meta http-equiv="Content-type" content="text/html; charset=IS0-8859-1" />someothertext' (length=95)

Of course, this is considering the input meta is always the same, always written the same way, with attributes in the same order and all that.

A regex a bit more forgiving might be :

$output = preg_replace('#<meta\s+http-equiv="Content-type"\s+content="text/html;\s+charset=(utf-8)"\s+/>#', 
    '<meta http-equiv="Content-type" content="text/html; charset=IS0-8859-1" />', 
    $input);

Of course, that is still not really forgiving ^^


But, if you know the meta used as input will alsways be the same, you don't need a regex ; str_replace will do the job just fine, I suppose...

Something like this :

$output = str_replace('<meta http-equiv="Content-type" content="text/html; charset=utf-8" />', 
    '<meta http-equiv="Content-type" content="text/html; charset=IS0-8859-1" />', 
    $input);
var_dump($output);

Which gets you the same output :

string 'sometext<meta http-equiv="Content-type" content="text/html; charset=IS0-8859-1" />someothertext' (length=95)



EDIT after comments and edition of the OP
*(Yeah, I've seen another answer, based on str_replace, has been accepted... still, maybe this will be useful)*

If you really want to manipulate HTML that is not "fixed", over which you have no control, it might be better to not use regex at all, but some tool made exactly for that.

For instance, the bundled class DOMDocument, and it's DOMDocument::loadHTML can probably help ; maybe coupled with some XPath queries -- even if it kinda feels like heavy artillery ^^

For more informations, you can take a look at this answer I gave to another question a few days ago...

And, in your case, something like this would probably do :

$input = <<<HTML
<html>
<head>
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <title>Test</title>
</head>
<body>
    <p>Hello, world!</p>
</body>
</html>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($input);

$xpath = new DOMXpath($dom);
$metas = $xpath->query('//meta[@http-equiv="Content-type"]');

if ($metas->length > 0) {
    $meta = $metas->item(0);
    $attribute = $meta->getAttribute('content');
    if (strpos($attribute, 'text/html') === 0) {
        $meta->setAttribute('content', 'text/html; charset=ISO-8859-1');
    }
}

echo $dom->saveHTML();

The most interesting parts are :

  • You are using a DOM parser, with standard DOM methods
  • You can do XPath queries to locate exactly the element you need


The resulting HTML will look like this :

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=ISO-8859-1">
<title>Test</title>
</head>
<body>
    <p>Hello, world!</p>
</body>
</html>

Maybe a bit heavier, and requires more code... But, with that, it should always work (well, as long as the HTML used as input is not too messed up, I suppose).

And it will work for anything else in the document ;-)


Maybe it's a bit too much in your case, but, with some luck, you will remember this the day you have to parse some HTML, and won't end up fighting with/against any kind of mutant regex ^^


Oh, and, of course : changing the meta content-type will not change the real encoding of your content : you'll still have to do that yourself, if necessary (for instance, see iconv or utf8_decode)

You might also need to change the HTTP Content-type header (not sure about how browsers deal with the meta if/when the HTTP header is set)

Pascal MARTIN
Thank you for your time, but you should read what i commented on the other posts. Cause every occurence is not like that...
OK about reading comments on other posts ; at that exact moment, there is only one other answer, and the only comment it has is "Read what i commented on karim79's post. :)" ^^ So... ok :-D I suppose karim79 has deleted his answer ^^ ; if your question is not what you asked in the OP, we cannot guess what it is ;-) you should edit the OP to ask the "full" question ; it will be way easier to help you this way :-)
Pascal MARTIN
I edited my first post. Read it now :)
I edited my answer to give a bit more informations ; it might be a bit "heavy", but might be useful one day or another ^^
Pascal MARTIN
A: 

you could just match 'charset=*"' and replace the *, whatever it is, with "ISO-8859-1".

Something like this:

$content = preg_replace('/(charset=)(.+)\"/', "$1"."ISO-8859-1", $content);
Johan