ansaurus

Question

How do I handle unicode user input in Scala safely (esp XML entities)

Answer 1

+1 A:

Ok, I am trying this simple hack. Comments welcome:

def secureEscape(text: String) = {
  val s = new StringBuilder()
  for (c <- text.elements) c match {
   case '<' => s.append("&lt;")
   case '>' => s.append("&gt;")
   case _   => s.append(c)
  }
  s.toString
}

This will basically escape < and >.

I then use this function to parse the incoming form input and then dish it out without further processing to the client.

HRJ 2010-01-09 16:10:06

Answer 2

A:

Really, the browser should be responsible for the correct UTF-8 encoding and escaping of characters (this appears to be happening). Your web framework should then handle the unescaping and decoding.

This can be a tricky business, with several steps involved, all of which may have to be explicitly configured for correct UTF-8 operation. Especially when working with older frameworks and servers, caching proxies, content delivery networks, etc.

The point being that, internally, you want to be seeing the expected unicode characters - not the entity refs. Likewise, you should be outputting native unicode and handle and required encoding at the boundary of your system, preferably this will be automatically handled by your choice of web framework.

In order to give you the correct solution, it's necessary to know what software stack(s) you're using and how the form is being submitted (i.e. GET/POST/AJAX+JSON)

Kevin Wright 2010-01-09 18:08:44

Yup, getting the whole chain configured for UTF8 is a pain. I have a home brewn stack running over a servlet engine. So this is a back-to-basics question. The form is being submitted via POST.

HRJ 2010-01-09 19:59:34

Answer 3

+2 A:

Parse the string containing entity references as a fragment of XML. To safely output the Unicode characters in XML, you can be paranoid and use XML entity references for them, as per the function escape

scala>import xml.parsing.ConstructingParser                                                             
import xml.parsing.ConstructingParser

scala>import io.Source                                                                                  
import io.Source

scala> val d = ConstructingParser.fromSource(Source.fromString("<dummy>&#12420;</dummy>"), true).documnent
d: scala.xml.Document = <dummy>や</dummy>

scala>val t = d(0).text                                                                                         
res0: String = や

scala> import xml._
import xml._

scala> def escape(xmlText: String): NodeSeq = {
     |   def escapeChar(c: Char): xml.Node =
     |     if (c > 0x7F || Character.isISOControl(c))
     |       xml.EntityRef("#" + Integer.toString(c, 10))
     |     else
     |       xml.Text(c.toString)
     | 
     |   new xml.Group(xmlText.map(escapeChar(_)))
     | }
escape: (xmlText: String)scala.xml.NodeSeq

scala> <foo>{escape(t)}</foo>                            
res3: scala.xml.Elem = <foo>&#12420;</foo>

retronym 2010-01-10 22:16:59

Bingo! Thanks. I am still digesting the paranoid half. But the first half is on the mark.

HRJ 2010-01-11 12:40:36

If you don't trust a client of your XML output to correctly decode UTF-8 (for example, if it may be edited by Notepad!), you can restrict yourself to ASCII output, and use XML entity references to escape everything else.JDOM makes this really easy. I haven't found the corresponding mechanism in Scala XML, hence the hand rolled function escape above.JDOM: `format.setEscapeStrategy(new EscapeStrategy() { public boolean shouldEscape(char ch) { return !isAscii(ch) || defaultEscapeStrategy.shouldEscape(ch); } })`

retronym 2010-01-11 15:31:27

@retronym ah ok. I had wrongly assumed that scala.xml.Text() took care of that, but apparently it doesn't.

HRJ 2010-01-11 15:52:59

Answer 4

A:

Browsers only encode input characters to numeric character reference entities when the character is outside the character set the page was served in. Save yourself a lot of trouble and serve your pages in UTF-8 properly tagged as UTF-8. Scala, Java and Javascript string processing is all in Unicode, and restricting to iso-8859-1 for your web pages is inviting conversion problems like this in all directions. If your existing content is ASCII then conversion should be painless.

Joseph Boyle 2010-01-12 22:43:36

Did I not understand you fully, or did you miss the part about script attacks?

HRJ 2010-01-13 05:05:52

If your pages are in UTF-8 or a Japanese character set, you will receive user input as actual Japanese characters and not the entity escapes. If you are not getting the entities in the first place, you will not be outputting them and therefore not susceptible to that kind of script attack.

Joseph Boyle 2010-01-13 22:35:49

ansaurus

tags:

views:

answers:

How do I handle unicode user input in Scala safely (esp XML entities)

related questions