ansaurus

Question

regex (in PHP) to match & that aren't HTML entities

Answer 1

+6 A:

You could always run html_entity_decode before you run htmlentities? Works unless you only want to do ampersands (and even then you can play with the charset parameters).

Much easier and faster than a regex.

Ross 2008-11-21 23:52:18

I probably should have explained the overall goal. I have strings that are UTF-8 and *may* have HTML entities. I want to use them in the description element of a UTF-8 RSS feed. As far as I can tell, RSS/XML accepts the hex HTML entities but few or none of the alphanumeric ones.

Doug Kaye 2008-11-22 00:10:22

Use Atom instead of RSS - it doesn't have problems with entites :)

porneL 2008-11-22 23:36:17

Answer 2

+2 A:

Ross led me to a good answer. Here's the code that seems to work fairly well. So far. :-) The goal, again, is the convert HTML to XML, specifically descriptions for RSS feeds. In the brief testing I've done so far (with some fairly fairly quirky data) I've been able to take strings wrapped in CDATA and unwrap it. Passes validation tests. Thanks, Ross.

//decode all entities
$string=html_entity_decode($string,ENT_COMPAT,'UTF-8');

//entity-encode only &<> and double quotes
$string=htmlspecialchars($string,ENT_COMPAT,'UTF-8');

Doug Kaye 2008-11-22 00:32:57

Answer 3

+1 A:

The others are good suggestions, and might be better way to do it. But I thought I'd try to answer the question as asked--if only to provide a regex example.

The following is the special exploded form allowed in some engines. Of course the odd thing is that an engine which allows commented regexes allow other simplified expresssions--but not as generic. I'll put those simplified expressions in parens in the comments.

&                      # an ampersand
( \#                   # a '#' character
  [1-9]                # followed by a non-zero digit, 
  [0-9]{1,3}           # with between 2 and 4             (\d{1,3} or \p{IsDigit}{1,3})
| [A-Za-z]             # OR a letter                      (\p{IsAlpha})
  [0-9A-Za-z]+         # followed by letters or numbers   (\p{IsAlnum}+)
)
;                      # all capped with a ';'

You could even throw a bunch of expected entities in there as well, to help out the regex scanner.

&                      # an ampersand
( amp | apos | gt | lt | nbsp | quot                 
                       # standard entities
| bull | hellip | [lr][ds]quo | [mn]dash | permil          
                       # some fancier ones
| \#                   # a '#' character
  [1-9]                # followed by a non-zero digit, 
  [0-9]{1,3}           # with between 2 and 4 
|  [A-Za-z]            # OR a letter
  [0-9A-Za-z]+         # followed by letters or numbers
)
;                      # all capped with a ';'

Axeman 2008-11-22 23:15:01

You can also have the ' ' type, so you need to add one more branch, but good point in actually stating a regexp.

subtenante 2009-11-04 08:26:48

subtenante 2009-11-04 08:41:27

Answer 4

+1 A:

PHP's htmlentities() has double_encode argument for this.

If you want to do things like that in regular expressions, then negative assertions come useful:

preg_replace('/&(?![a-z#]+;)/i','&amp;',$txt);

porneL 2008-11-22 23:34:16

+1 ! But to be thorough, the regexp should be : )

subtenante 2009-11-04 10:50:22

Answer 5

A:

WhoIsRich 2010-03-26 19:51:27

ansaurus

tags:

views:

answers:

regex (in PHP) to match & that aren't HTML entities

related questions