views:

807

answers:

3

I need a way to transform numeric HTML entities into their plain-text character equivalent. For example, I would like to turn the entity:

é

into the character:

é

Through some googling around I found a function called HtmlUnEditFormat, but this function only transforms named entities. Is there a way to decode numeric entities in ColdFusion?

+1  A: 

It should be quite easy to code one up yourself. Just edit the HtmlUNEditFormat() func you found, to include them to the end of the lEntities & lEntitiesChars.

Henry
I think there are something like over 100,000 potential numeric entities. I could definitely add a few hundred and cover the most-used entities, but I was hoping for something that would cover everything.
pb
If you were going down this route (don't; see my answer) you could do something like `Chr(rereplace(Arguments.Entity,'\D',''))` - after determining if it was a decimal entity. Hex entities would be similar but would need to convert the hex to decimal to use the Chr function.
Peter Boughton
(Oh, and there should be an 'all' argument at the end of the above rereplace call)
Peter Boughton
Actually, I would not be surprised if this approach would turn out as way faster than XML-parsing a single character with each function call. +1 from me.
Tomalak
+8  A: 

That linked function is icky - there's no need to name them explicitly, and as you say it doesn't do numerics.

Much simpler is to let CF do the work for you - using the XmlParse function:

<cffunction name="decodeHtmlEntity" returntype="String" output="false">
    <cfargument name="Entity" type="String" hint="&##<number>; or &<name>;" />
    <cfreturn XmlParse('<xml>#Arguments.Entity#</xml>').XmlRoot.XmlText />
</cffunction>

That one works with Railo, I can't remember if CF supports that syntax yet though, so you might need to change it to:

<cffunction name="decodeHtmlEntity" returntype="String" output="false">
    <cfargument name="Entity" type="String" hint="&##<number>; or &<name>;" />
    <cfset var XmlDoc = XmlParse('<xml>#Arguments.Entity#</xml>') />
    <cfreturn XmlDoc.XmlRoot.XmlText />
</cffunction>
Peter Boughton
this is smart! cool!
Henry
+1 Very nice! (though a bit resource-heavy)
Tomalak
+1 Nice out-of-the-box thinking for a built-in function.
Al Everett
For a more lightweight option, we should in theory be able to dip into the Java classes that are used to implement XmlParse and find the specific entity decoding/resolving method to use - but I've just been looking through the apidocs and not been able to find anything.
Peter Boughton
This is excellent and works well, many thanks! For fun I've been digging into a Java solution to this and found that the Apache commons string escaping utilities contains an *unescapeHtml* function. Docs here: http://tinyurl.com/3n9pem That might do the same thing with less overhead, but it requires installing a new Java class on the server, restarting, etc. so I haven't tried it yet. For now, this works perfectly. Thanks again!
pb
A: 

I found this question while working with a method that, by black-box principle, can't trust that an incoming string is either HTML entity encoded or that it is not.

I've adapted Peter Boughton's function so that it can be used safely on strings that haven't already been treated with HTML entities. (The only time this seems to matter is when loose ampersands - i.e. "Cats & Dogs" - are present in the target string.) This modified version will also fail somewhat gracefully on any unforseen XML parse error.

<cffunction name="decodeHtmlEntity" returntype="string" output="false">
    <cfargument name="str" type="string" hint="&##<number>; or &<name>;" />
    <cfset var XML = '<xml>#arguments.str#</xml>' />
    <cfset var XMLDoc = '' />

    <!--- ampersands that aren't pre-encoded as entities cause errors --->
    <cfset XML = REReplace(XML, '&(?!(\##\d{1,3}|\w+);)', '&amp;', 'all') />

    <cftry>
        <cfset XMLDoc = XmlParse(XML) />
        <cfreturn XMLDoc.XMLRoot.XMLText />
        <cfcatch>
            <cfreturn arguments.str />
        </cfcatch>
    </cftry>
</cffunction>

This would support the following use case safely:

<cffunction name="notifySomeoneWhoCares" access="private" returntype="void">
    <cfargument name="str" type="string" required="true"
        hint="String of unknown preprocessing" />
    <cfmail from="[email protected]" to="[email protected]"
        subject="Comments from Web User" format="html">
        Some Web User Spoke Thus:<br />
        <cfoutput>#HTMLEditFormat(decodeHTMLEntity(arguments.str))#</cfoutput>
    </cfmail>
</cffunction>

This function is now incredibly useful for ensuring web-submitted content is entity-safe (think XSS) before it's sent out by email or submitted into a database table.

Hope this helps.

Eric Kolb