tags:

views:

85

answers:

5

I need a regex (to work in PHP) to replace American English words in HTML with British English words. So color would be replaced by colour, meters by metres and so on [I know that meters is also a British English word, but for the copy we'll be using it will always be referring to units of distance rather than measuring devices]. The pattern would need to work accurately in the following (slightly contrived) examples (although as I have no control over the actual input these could exist):

<span style="color:red">This is the color red</span>

[should not replace color in the HTML tag but should replace it in the sentence]

<p>Color: red</p>

[should replace word]

<p>Tony Brammeter lives 2000 meters from his sister</p>

[should replace meters for the word but not in the name]

I know there are edge cases where replacement wouldn't be useful (if his name was Tony Meter for example), but these are rare enough that we can deal with them when they come up.

A: 

You don't need to use a regex explicitly. You can try the str_replace function, or if you need it to be case insensitive use the str_ireplace function.

Example:

$str = "<p>Color: red</p>";
$new_str = str_ireplace ('%color%', 'colour', $str);

You can pass an array with all the words that you want to search for, instead of the string.

rogeriopvl
Except that I'm fairly sure that would fail #1 and #3 of his examples; the latter would need word-boundary checking (`\bword\b` in PCRE-based regex), and the former some at least primitive tag-checking.
Twisol
+4  A: 

I think you'd rather need a dictionary and maybe even some grammatical analysis in order to get this working correctly, since you don't have control over the input. A pure regex solution is not really going to be able to process this kind of data correctly.

So I'd suggest to first come up with a list of words that need to be replaced, those are not only "color" and "meter". Wikipedia has some information on the topic.

Lucero
+1  A: 

You do not want a regular expression for this. Regular expressions are by their very nature stateless, and you need some measure of state to be able to tell the difference between 'in a html tag' and 'in data'.

You want to be using a HTML parser in combination with something like a str_replace, or even better, use a proper grammer dictionary and stuff as Lucero suggests.

Matthew Scharley
+1  A: 

The second problem is easier - you want to replace when there are word boundaries around the word: http://www.regular-expressions.info/wordboundaries.html -- this will make sure you don't replace the meter in Brammeter.

The first problem is much harder. You don't want to replace words inside HTML entities - nothing between <> characters. So, your match must make sure that you last saw > or nothing, but never just <. This is either hard, and requires some combination of lookahead/lookbehind assertions, or just plain impossible with regular expressions.

a script implementing a state machine would work much better here.

Igor
+4  A: 

Html/xml should not be processed with regular expressions, it is really hard to generate one that will match anything. But you can use the builtin dom extension and process your string recursively:

# Warning: untested code!
function process($node, $replaceRules) {
    foreach ($node->children as $childNode) {
        if ($childNode instanceof DOMTextNode) {
            $text = pre_replace(
                array_keys(replaceRules),
                array_values($replaceRules),
                $childNode->wholeText
            );
            $node->replaceChild($childNode, new DOMTextNode($text));
        } else {
            process($childNode, $replaceRules);
        }
    }
}
$replaceRules = array(
    '/\bcolor\b/i' => 'colour',
    '/\bmeter\b/i' => 'metre',
);
$doc = new DOMDocument();
$doc->loadHtml($htmlString);
process($doc, $replaceRules);
$htmlString = $doc->saveHTML();
soulmerge
Cool. This seems to have worked well. I had to make some changes to the code to get it to work (DOMTextNode didn't work for me, while DOMText did; swapping the arguments around in $node->replaceChild etc), but so far it looks to have worked nicely. The only slight issue is I want to do this on strings, and using new DOMDocument turns the string into an HTML page with a doctype andf wrapped in html and body tags. I can remove this using standard str_replace etc (or , but is there a better way that does not create these in the first place?
Apemantus