views:

77

answers:

3

I'm writing a blog app with Django. I want to enable comment writers to use some tags (like <strong>, a, et cetera) but disable all others.

In addition, I want to let them put code in <code> tags, and have pygments parse them.

For example, someone might write this comment:

I like this article, but the third code example <em>could have been simpler</em>:

<code lang="c">
#include <stdbool.h>
#include <stdio.h>

int main()
{
    printf("Hello World\n");
}
</code>

Problem is, when I parse the comment with BeautifulSoup to strip disallowed HTML tags, it also parses the insides of the <code> blocks, and treats <stdbool.h> and <stdio.h> as if they were HTML tags.

How could I tell BeautifulSoup not to parse the <code> blocks? Maybe there are other HTML parsers better for this job?

+1  A: 

The problem is that <code> is treated according to the normal rules for HTML markup, and content inside <code> tags is still HTML (The tags exists mainly to drive CSS formatting, not to change the parsing rules).

What you are trying to do is create a different markup language that is very similar, but not identical, to HTML. The simple solution would be to assume certain rules, such as, "<code> and </code> must appear on a line by themselves," and do some pre-processing yourself.

  1. A very simple — though not 100% reliable — technique is to replace ^<code>$ with <code><![CDATA[ and ^</code>$ with ]]></code>. It isn't completely reliable, because if the code block contains ]]>, things will go horribly wrong.
  2. A safer option is to replace dangerous characters inside code blocks (<, > and & probably suffice) with their equivalent character entity references (&lt;, &gt; and &amp;). You can do this by passing each block of code you identify to cgi.escape(code_block).

Once you've completed preprocessing, submit the result to BeautifulSoup as usual.

Marcelo Cantos
Option #2 seems like a winner. How would I go about that? Regular expressions, or some sophisticated string processing algorithm?
Dor
@Dor: I've amended my answer to cover this.
Marcelo Cantos
I've tried this, but obviously cgi.escape expects a string, not a BeautifulSoup tag object :) How can I escape the contents of the tag prior to the parsing?
Dor
You should extract the text between the `<code>` and `</code>` lines as per my original answer, pass it through `cgi.escape` and concatenate it all back together. Then (and only then) pass the whole thing to BeautifulSoup.
Marcelo Cantos
A: 

Unfortunately, BeautifulSoup can not be blocked to parse the code blocks.

One solution to what you want to achieve is too

1) Remove the code blocks

soup = BeautifulSoup(unicode(content))
code_blocks = soup.findAll(u'code')
for block in code_blocks:
    block.replaceWith(u'<code class="removed"></code>')

2) Do the usual parsing to strip the non-allowed tags.

3) Re-insert the code blocks and re-generate the html.

stripped_code = stripped_soup.findAll(u"code", u"removed")
# re-insert pygment formatted code

I would have answered with some code, but I recently read a blog that does this elegantly.

pyfunc
When I first parse the string, BeautifulSoup inserts the closing </stdbool.h> and </stdio.h> tags. So even if I used this technique I'd still get these closing tags in my code blocks.
Dor
A: 

From Python wiki

>>>import cgi
>>>cgi.escape("<string.h>")
>>>'&lt;string.h&gt;'

>>>BeautifulSoup('&lt;string.h&gt;', 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
N 1.1
That way I'd have to write every possible tag, wouldn't I?
Dor
@Dor: why? just pass everything inside `<code>` to `cgi.escape`
N 1.1
That's the main part of the question - how?
Dor