tags:

views:

1677

answers:

5

Here's the goal: to replace all standalone ampersands with & but NOT replace those that are already part of an HTML entity such as  .

I think I need a regular expression for PHP (preferably for preg_ functions) that will match only standalone ampersands. I just don't know how to do that with preg_replace.

+6  A: 

You could always run html_entity_decode before you run htmlentities? Works unless you only want to do ampersands (and even then you can play with the charset parameters).

Much easier and faster than a regex.

Ross
I probably should have explained the overall goal. I have strings that are UTF-8 and *may* have HTML entities. I want to use them in the description element of a UTF-8 RSS feed. As far as I can tell, RSS/XML accepts the hex HTML entities but few or none of the alphanumeric ones.
Doug Kaye
Use Atom instead of RSS - it doesn't have problems with entites :)
porneL
+2  A: 

Ross led me to a good answer. Here's the code that seems to work fairly well. So far. :-) The goal, again, is the convert HTML to XML, specifically descriptions for RSS feeds. In the brief testing I've done so far (with some fairly fairly quirky data) I've been able to take strings wrapped in CDATA and unwrap it. Passes validation tests. Thanks, Ross.

//decode all entities
$string=html_entity_decode($string,ENT_COMPAT,'UTF-8');

//entity-encode only &<> and double quotes
$string=htmlspecialchars($string,ENT_COMPAT,'UTF-8');
Doug Kaye
+1  A: 

The others are good suggestions, and might be better way to do it. But I thought I'd try to answer the question as asked--if only to provide a regex example.

The following is the special exploded form allowed in some engines. Of course the odd thing is that an engine which allows commented regexes allow other simplified expresssions--but not as generic. I'll put those simplified expressions in parens in the comments.

&                      # an ampersand
( \#                   # a '#' character
  [1-9]                # followed by a non-zero digit, 
  [0-9]{1,3}           # with between 2 and 4             (\d{1,3} or \p{IsDigit}{1,3})
| [A-Za-z]             # OR a letter                      (\p{IsAlpha})
  [0-9A-Za-z]+         # followed by letters or numbers   (\p{IsAlnum}+)
)
;                      # all capped with a ';'

You could even throw a bunch of expected entities in there as well, to help out the regex scanner.

&                      # an ampersand
( amp | apos | gt | lt | nbsp | quot                 
                       # standard entities
| bull | hellip | [lr][ds]quo | [mn]dash | permil          
                       # some fancier ones
| \#                   # a '#' character
  [1-9]                # followed by a non-zero digit, 
  [0-9]{1,3}           # with between 2 and 4 
|  [A-Za-z]            # OR a letter
  [0-9A-Za-z]+         # followed by letters or numbers
)
;                      # all capped with a ';'
Axeman
You can also have the ' ' type, so you need to add one more branch, but good point in actually stating a regexp.
subtenante
subtenante
+1  A: 

PHP's htmlentities() has double_encode argument for this.

If you want to do things like that in regular expressions, then negative assertions come useful:

preg_replace('/&(?![a-z#]+;)/i','&amp;',$txt);
porneL
+1 ! But to be thorough, the regexp should be : )
subtenante
A: 
WhoIsRich