ansaurus

Question

Answer 1

+2 A:

If your needs are simple, you could do this with a map over the chars in the string.

quote($<) -> "&lt;";
quote($>) -> "&gt;";
quote($&) -> "&amp;";
quote($") -> "&quot;";
quote(C) -> C.

Then you would do

1> Raw = "string & \"stuff\" <".
2> Quoted = lists:map(fun quote/1, Raw).

But Quoted would not be a flat list, which is still fine if you are going to send it to a file or as a http reply. I.e. see Erlang's io-lists.

In more recent Erlang releases, there are now encode-decode functions for multibyte utf8 to wide-byte/codepoint representations, see the erlang unicode module.

Reformatted comments, to make code examples stand out:

ettore: That's kind of what I am doing, although I do have to support multibyte characters. Here's my code:

xmlencode([], Acc) -> Acc; 
xmlencode([$<|T], Acc) -> xmlencode(T, Acc ++ "&lt;"); % euro symbol
xmlencode([226,130,172|T], Acc) -> xmlencode(T, Acc ++ "&#8364;");
xmlencode([OneChar|T], Acc) -> xmlencode(T, lists:flatten([Acc,OneChar])).

Although I would prefer not to reinvent the wheel if possible.

dsmith: The string that you are using would normally be a list of Unicode code-points (ie. a list of numbers), and so any given byte encoding is irrelevant. You would only need worry about specific encodings if you are working directly with binaries.

To clarify, the Unicode code-point for the euro symbol (decimal 8364) would be a single element in your list. So you would just do this:

xmlencode([8364|T], Acc) -> xmlencode(T, Acc ++ "&#8364;");

Christian 2010-07-26 22:06:23

That's kind of what I am doing, although I do have to support multibyte characters. Here's my code:xmlencode([], Acc) -> Acc; xmlencode([$<|T], Acc) -> xmlencode(T, Acc ++ "<");% euro symbolxmlencode([226,130,172|T], Acc) -> xmlencode(T, Acc ++ "€"); xmlencode([OneChar|T], Acc) -> xmlencode(T, lists:flatten([Acc,OneChar])).Although I would prefer not to reinvent the wheel if possible.

ettore 2010-07-27 00:23:50

@ettore - The string that you are using would normally be a list of Unicode code-points (ie. a list of numbers), and so any given byte encoding is irrelevant. You would only need worry about specific encodings if you are working directly with binaries.

dsmith 2010-07-27 14:31:21

To clarify, the Unicode code-point for the euro symbol (decimal 8364) would be a single element in your list. So you would just do this:xmlencode([8364|T], Acc) -> xmlencode(T, Acc ++ "€");

dsmith 2010-07-27 14:41:58

Thx but using xmlencode([8364|T],Acc)-> doesn't work for me. I'm in fact storing the string as binary in MySQL, and when I read it back I convert the binary to string, never matching the pattern. Similarly when the client sends the euro symbol, since it uses UTF8, the server receives those 3 bytes listed above, not a 8364 int. Unless I'm missing something obvious.What's interesting is that if my outbound XML header specifies the UTF8 encoding, seems like the euro symbol is parsed fine, no entity-encoding needed. (I use UTF8 across the board.) Can I rely on this though for other chars?

ettore 2010-07-28 18:55:01

Again, the standard representation of character strings using lists if one element per character see EEP 10 at http://www.erlang.org/eeps/eep-0010.html#lists. That is, each element should be a single unicode code-point, and not a UTF-8 byte. EEP 10 will also provide routines to convert a binary containing UTF-8 encoded data to a list of unicode code-points. This is implemented in the unicode module of STDLIB.

dsmith 2010-07-30 14:24:59

Answer 2

+2 A:

There is a function in the Erlang distribution that escapes angle brackets and ampersands but it isn't documented so probably not best to rely on it:

1> xmerl_lib:export_text("string & \"stuff\" <").
"string &amp; \"stuff\" &lt;"

If you're wanting to build/encode XML structures (instead of just encoding a single string), then the xmerl API would be a good option, e.g.

2> xmerl:export_simple([{foo, [], ["string & \"stuff\" <"]}], xmerl_xml).
["<?xml version=\"1.0\"?>",
 [[["<","foo",">"],
   ["string &amp; \"stuff\" &lt;"],
   ["</","foo",">"]]]]

Tim Fletcher 2010-07-26 22:18:15

ettore 2010-07-27 00:16:46

So you want a function like PHP's htmlentities, not PHP's htmlspecialchars?I'd port it from something like http://htmlentities.rubyforge.org/ or http://phpjs.org/functions/htmlentities

Tim Fletcher 2010-07-27 14:07:48

Answer 3

+1 A:

I'm not aware of one in the included OTP pakages. However Mochiweb's mochiweb_html module: has an escape function: mochiweb_html.erl it handles lists, binaries, and atoms.

And for url encoding checkout the mochiweb_util module: mochiweb_util.erl with its urlescape function.

You could use either of those libraries to get what you needed.

Jeremy Wall 2010-07-28 06:04:50

Thanks, I had a look at mochiweb_html:escape, however it looks like it escapes only the predefined entities. Any other characters, whatever byte content they may have, seem to be left as-is unless I missed something: escape_attr([C | Rest], Acc) -> escape_attr(Rest, [C | Acc]).For the purpose of building well-formed XML attributes this is enough I suppose?

ettore 2010-07-28 19:22:37

It's enough to build well-formed xml. It may or not be enough to support your escaping needs though. If you have application specific escaping needs you may need to implement them yourself.

Jeremy Wall 2010-08-05 17:41:38

ansaurus

tags:

views:

answers:

How do I XML-encode a string in Erlang?

related questions