views:

205

answers:

2
+1  Q: 

Markdown and XSS

Ok, so I have been reading about markdown here on SO and elsewhere and the steps between user-input and the db are usually given as

  1. convert markdown to html
  2. sanitize html (w/whitelist)
  3. insert into database

but to me it makes more sense to do the following:

  1. sanitize markdown (remove all tags - no exceptions)
  2. convert to html
  3. insert into database

Am I missing something? This seems to me to be pretty nearly xss-proof

+1  A: 

There are two issues with what you've proposed:

  1. I don't see a way for your users to be able to format posts. You took advantage of Markdown to provide nice numbered lists, for example. In the proposed no-tags-no-exceptions world, I'm not seeing how the end user would be able to do such a thing.
  2. Considerably more important: When using Markdown as the "native" formatting language, and whitelisting the other available tags,you are limiting not just the input side of the world, but the output as well. In other words, if your display engine expects Markdown and only allows whitelisted content out, even if (God forbid) somebody gets to the database and injects some nasty malware-laden code into a bunch of posts, the actual site and its users are protected because you are sanitizing it upon display, as well.

There are some good resources on the web about output sanitization:

John Rudy
+1  A: 

Well certainly removing/escaping all tags would make a markup language more secure. However the whole point of Markdown is that it allows users to include arbitrary HTML tags as well as its own forms of markup(*). When you are allowing HTML, you have to clean/whitelist the output anyway, so you might as well do it after the markdown conversion to catch everything.

*: It's a design decision I don't agree with at all, and one that I think has not proven useful at SO, but it is a design decision and not a bug.

Incidentally, step 3 should be ‘output to page’; this normally takes place at the output stage, with the database containing the raw submitted text.

bobince