views:

231

answers:

4

On my website I have a form that takes in some textual user input. All works fine for "normal" characters. However when unicode characters are input... well, the plot thickens.

User inputs something like

やっぱ死にかけてる

This comes in to the server as text containing XML entity refs

やっぱ死にかけてる?

Now, when I want to serve this back to the client in HTML, how do I do it?

If I simply output the string as it is, there could be a chance for a script attack. If I try to encode it with scala.xml.Text it gets converted to:

やっぱ死にかけてる?

Is there a better ready-made solution in Scala which can detect entity refs and not escape them, yet escape XML tags?

+1  A: 

Ok, I am trying this simple hack. Comments welcome:

def secureEscape(text: String) = {
  val s = new StringBuilder()
  for (c <- text.elements) c match {
   case '<' => s.append("&lt;")
   case '>' => s.append("&gt;")
   case _   => s.append(c)
  }
  s.toString
}

This will basically escape < and >.

I then use this function to parse the incoming form input and then dish it out without further processing to the client.

HRJ
A: 

Really, the browser should be responsible for the correct UTF-8 encoding and escaping of characters (this appears to be happening). Your web framework should then handle the unescaping and decoding.

This can be a tricky business, with several steps involved, all of which may have to be explicitly configured for correct UTF-8 operation. Especially when working with older frameworks and servers, caching proxies, content delivery networks, etc.

The point being that, internally, you want to be seeing the expected unicode characters - not the entity refs. Likewise, you should be outputting native unicode and handle and required encoding at the boundary of your system, preferably this will be automatically handled by your choice of web framework.

In order to give you the correct solution, it's necessary to know what software stack(s) you're using and how the form is being submitted (i.e. GET/POST/AJAX+JSON)

Kevin Wright
Yup, getting the whole chain configured for UTF8 is a pain. I have a home brewn stack running over a servlet engine. So this is a back-to-basics question. The form is being submitted via POST.
HRJ
+2  A: 

Parse the string containing entity references as a fragment of XML. To safely output the Unicode characters in XML, you can be paranoid and use XML entity references for them, as per the function escape

scala>import xml.parsing.ConstructingParser                                                             
import xml.parsing.ConstructingParser

scala>import io.Source                                                                                  
import io.Source

scala> val d = ConstructingParser.fromSource(Source.fromString("<dummy>&#12420;</dummy>"), true).documnent
d: scala.xml.Document = <dummy>や</dummy>

scala>val t = d(0).text                                                                                         
res0: String = や

scala> import xml._
import xml._

scala> def escape(xmlText: String): NodeSeq = {
     |   def escapeChar(c: Char): xml.Node =
     |     if (c > 0x7F || Character.isISOControl(c))
     |       xml.EntityRef("#" + Integer.toString(c, 10))
     |     else
     |       xml.Text(c.toString)
     | 
     |   new xml.Group(xmlText.map(escapeChar(_)))
     | }
escape: (xmlText: String)scala.xml.NodeSeq

scala> <foo>{escape(t)}</foo>                            
res3: scala.xml.Elem = <foo>&#12420;</foo>
retronym
Bingo! Thanks. I am still digesting the paranoid half. But the first half is on the mark.
HRJ
If you don't trust a client of your XML output to correctly decode UTF-8 (for example, if it may be edited by Notepad!), you can restrict yourself to ASCII output, and use XML entity references to escape everything else.JDOM makes this really easy. I haven't found the corresponding mechanism in Scala XML, hence the hand rolled function escape above.JDOM: `format.setEscapeStrategy(new EscapeStrategy() { public boolean shouldEscape(char ch) { return !isAscii(ch) || defaultEscapeStrategy.shouldEscape(ch); } })`
retronym
@retronym ah ok. I had wrongly assumed that scala.xml.Text() took care of that, but apparently it doesn't.
HRJ
A: 

Browsers only encode input characters to numeric character reference entities when the character is outside the character set the page was served in. Save yourself a lot of trouble and serve your pages in UTF-8 properly tagged as UTF-8. Scala, Java and Javascript string processing is all in Unicode, and restricting to iso-8859-1 for your web pages is inviting conversion problems like this in all directions. If your existing content is ASCII then conversion should be painless.

Joseph Boyle
Did I not understand you fully, or did you miss the part about script attacks?
HRJ
If your pages are in UTF-8 or a Japanese character set, you will receive user input as actual Japanese characters and not the entity escapes. If you are not getting the entities in the first place, you will not be outputting them and therefore not susceptible to that kind of script attack.
Joseph Boyle