views:

240

answers:

3

I have a erlang string which may contain characters like & " < and so on:

1> Unenc = "string & \"stuff\" <".
ok

Is there a Erlang function somewhere that parses the string and encodes all the needed HTML/XML entities, such as:

2> Enc = xmlencode(Unenc).
"string &amp; &quot;stuff&quot; &lt;".

?

My use case is for relatively short strings, which come from user input. The output strings of the xmlencode function will be the content of XML attributes:

<company name="Acme &amp; C." currency="&euro;" />

The final XML will be sent over the wire appropriately.

+2  A: 

If your needs are simple, you could do this with a map over the chars in the string.

quote($<) -> "&lt;";
quote($>) -> "&gt;";
quote($&) -> "&amp;";
quote($") -> "&quot;";
quote(C) -> C.

Then you would do

1> Raw = "string & \"stuff\" <".
2> Quoted = lists:map(fun quote/1, Raw).

But Quoted would not be a flat list, which is still fine if you are going to send it to a file or as a http reply. I.e. see Erlang's io-lists.

In more recent Erlang releases, there are now encode-decode functions for multibyte utf8 to wide-byte/codepoint representations, see the erlang unicode module.


Reformatted comments, to make code examples stand out:

ettore: That's kind of what I am doing, although I do have to support multibyte characters. Here's my code:

xmlencode([], Acc) -> Acc; 
xmlencode([$<|T], Acc) -> xmlencode(T, Acc ++ "&lt;"); % euro symbol
xmlencode([226,130,172|T], Acc) -> xmlencode(T, Acc ++ "&#8364;");
xmlencode([OneChar|T], Acc) -> xmlencode(T, lists:flatten([Acc,OneChar])). 

Although I would prefer not to reinvent the wheel if possible.

dsmith: The string that you are using would normally be a list of Unicode code-points (ie. a list of numbers), and so any given byte encoding is irrelevant. You would only need worry about specific encodings if you are working directly with binaries.

To clarify, the Unicode code-point for the euro symbol (decimal 8364) would be a single element in your list. So you would just do this:

xmlencode([8364|T], Acc) -> xmlencode(T, Acc ++ "&#8364;"); 
Christian
That's kind of what I am doing, although I do have to support multibyte characters. Here's my code:xmlencode([], Acc) -> Acc; xmlencode([$<|T], Acc) -> xmlencode(T, Acc ++ "<");% euro symbolxmlencode([226,130,172|T], Acc) -> xmlencode(T, Acc ++ "€"); xmlencode([OneChar|T], Acc) -> xmlencode(T, lists:flatten([Acc,OneChar])).Although I would prefer not to reinvent the wheel if possible.
ettore
@ettore - The string that you are using would normally be a list of Unicode code-points (ie. a list of numbers), and so any given byte encoding is irrelevant. You would only need worry about specific encodings if you are working directly with binaries.
dsmith
To clarify, the Unicode code-point for the euro symbol (decimal 8364) would be a single element in your list. So you would just do this:xmlencode([8364|T], Acc) -> xmlencode(T, Acc ++ "€");
dsmith
Thx but using xmlencode([8364|T],Acc)-> doesn't work for me. I'm in fact storing the string as binary in MySQL, and when I read it back I convert the binary to string, never matching the pattern. Similarly when the client sends the euro symbol, since it uses UTF8, the server receives those 3 bytes listed above, not a 8364 int. Unless I'm missing something obvious.What's interesting is that if my outbound XML header specifies the UTF8 encoding, seems like the euro symbol is parsed fine, no entity-encoding needed. (I use UTF8 across the board.) Can I rely on this though for other chars?
ettore
Again, the standard representation of character strings using lists if one element per character see EEP 10 at http://www.erlang.org/eeps/eep-0010.html#lists. That is, each element should be a single unicode code-point, and not a UTF-8 byte. EEP 10 will also provide routines to convert a binary containing UTF-8 encoded data to a list of unicode code-points. This is implemented in the unicode module of STDLIB.
dsmith
+2  A: 

There is a function in the Erlang distribution that escapes angle brackets and ampersands but it isn't documented so probably not best to rely on it:

1> xmerl_lib:export_text("string & \"stuff\" <").
"string &amp; \"stuff\" &lt;"

If you're wanting to build/encode XML structures (instead of just encoding a single string), then the xmerl API would be a good option, e.g.

2> xmerl:export_simple([{foo, [], ["string & \"stuff\" <"]}], xmerl_xml).
["<?xml version=\"1.0\"?>",
 [[["<","foo",">"],
   ["string &amp; \"stuff\" &lt;"],
   ["</","foo",">"]]]]
Tim Fletcher
ettore
So you want a function like PHP's htmlentities, not PHP's htmlspecialchars?I'd port it from something like http://htmlentities.rubyforge.org/ or http://phpjs.org/functions/htmlentities
Tim Fletcher
+1  A: 

I'm not aware of one in the included OTP pakages. However Mochiweb's mochiweb_html module: has an escape function: mochiweb_html.erl it handles lists, binaries, and atoms.

And for url encoding checkout the mochiweb_util module: mochiweb_util.erl with its urlescape function.

You could use either of those libraries to get what you needed.

Jeremy Wall
Thanks, I had a look at mochiweb_html:escape, however it looks like it escapes only the predefined entities. Any other characters, whatever byte content they may have, seem to be left as-is unless I missed something: escape_attr([C | Rest], Acc) -> escape_attr(Rest, [C | Acc]).For the purpose of building well-formed XML attributes this is enough I suppose?
ettore
It's enough to build well-formed xml. It may or not be enough to support your escaping needs though. If you have application specific escaping needs you may need to implement them yourself.
Jeremy Wall