ansaurus

Question

Parsing a document with BeautifulSoup while not-parsing the contents of <code> tags

Answer 1

+1 A:

The problem is that <code> is treated according to the normal rules for HTML markup, and content inside <code> tags is still HTML (The tags exists mainly to drive CSS formatting, not to change the parsing rules).

What you are trying to do is create a different markup language that is very similar, but not identical, to HTML. The simple solution would be to assume certain rules, such as, "<code> and </code> must appear on a line by themselves," and do some pre-processing yourself.

A very simple — though not 100% reliable — technique is to replace ^<code>$ with <code><![CDATA[ and ^</code>$ with ]]></code>. It isn't completely reliable, because if the code block contains ]]>, things will go horribly wrong.
A safer option is to replace dangerous characters inside code blocks (<, > and & probably suffice) with their equivalent character entity references (<, > and &). You can do this by passing each block of code you identify to cgi.escape(code_block).

Once you've completed preprocessing, submit the result to BeautifulSoup as usual.

Marcelo Cantos 2010-10-24 07:44:37

Option #2 seems like a winner. How would I go about that? Regular expressions, or some sophisticated string processing algorithm?

Dor 2010-10-26 19:07:44

@Dor: I've amended my answer to cover this.

Marcelo Cantos 2010-10-26 20:49:15

I've tried this, but obviously cgi.escape expects a string, not a BeautifulSoup tag object :) How can I escape the contents of the tag prior to the parsing?

Dor 2010-10-26 22:40:49

You should extract the text between the `<code>` and `</code>` lines as per my original answer, pass it through `cgi.escape` and concatenate it all back together. Then (and only then) pass the whole thing to BeautifulSoup.

Marcelo Cantos 2010-10-26 22:58:33

Answer 2

A:

Unfortunately, BeautifulSoup can not be blocked to parse the code blocks.

One solution to what you want to achieve is too

1) Remove the code blocks

soup = BeautifulSoup(unicode(content))
code_blocks = soup.findAll(u'code')
for block in code_blocks:
    block.replaceWith(u'<code class="removed"></code>')

2) Do the usual parsing to strip the non-allowed tags.

3) Re-insert the code blocks and re-generate the html.

stripped_code = stripped_soup.findAll(u"code", u"removed")
# re-insert pygment formatted code

I would have answered with some code, but I recently read a blog that does this elegantly.

http://iboris.com/page/add-source-code-syntax-highlighting-your-django-content-pygments.html

pyfunc 2010-10-24 07:49:59

When I first parse the string, BeautifulSoup inserts the closing </stdbool.h> and </stdio.h> tags. So even if I used this technique I'd still get these closing tags in my code blocks.

Dor 2010-10-24 15:45:48

Answer 3

A:

From Python wiki

>>>import cgi
>>>cgi.escape("<string.h>")
>>>'&lt;string.h&gt;'

>>>BeautifulSoup('&lt;string.h&gt;', 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)

N 1.1 2010-10-24 07:50:59

That way I'd have to write every possible tag, wouldn't I?

Dor 2010-10-24 14:04:01

@Dor: why? just pass everything inside `<code>` to `cgi.escape`

N 1.1 2010-10-24 14:11:35

That's the main part of the question - how?

Dor 2010-10-24 15:47:57

ansaurus

tags:

views:

answers:

Parsing a document with BeautifulSoup while not-parsing the contents of <code> tags

related questions