ansaurus

Question

Parse HTML-like metalanguage with Regular Expressions (Python)

Answer 1

A:

try beautifulsoup instead:

http://www.crummy.com/software/BeautifulSoup/

Here's a tidbit on lxml and beautifulsoup:

http://codespeak.net/lxml/elementsoup.html

Eric Snow 2010-08-13 07:24:15

Answer 2

A:

If you're looking for a nice, robust, powerful HTML parser, lxml is where it's at. Way better than BeautifulSoup in basically every way.

Aaron Gallagher 2010-08-13 07:26:34

Answer 3

+1 A:

How to parse HTML with regular expressions: Don't

HTML is too complex to be parsed by regular expressions. Find yourself an HTML parser and learn it.

haydenmuhl 2010-08-13 07:43:47

It's less of 'too complex' and more of 'HTML isn\'t a regular language.'

Aaron Gallagher 2010-08-13 07:46:34

I know that, but I don't expect everyone to be familiar with the Chomsky hierarchy or formal grammars.

haydenmuhl 2010-08-13 18:09:51

Answer 4

+3 A:

Your question in some way includes the answer. You want to parse HTML and at first glance Regular Expressions look ideal for the job. However, even with a quick look you're finding problems which seem difficult to solve.

It turns out the answer is not a more complex regular expression; in fact that approach will end in madness.

The clever thing to do now is realise you've made a mistake that lots of other people have made, and step away from the Regular Expression and find the correct tool for the job.

The formal answer is that HTML is not a Regular language and so can't be successfully parsed with Regular Expressions.

The right tool to use is an HTML Parser. Have a look at Beautiful Soup, lxml or sgmllib.

Dave Webb 2010-08-13 08:01:19

Answer 5

+2 A:

Reading your comment, I see that you don't want to parse HTML -- you want to parse your own custom template library. As all the other posters have said, don't use regex.

The correct solution is pyparsing.

katrielalex 2010-08-13 09:04:10

Answer 6

+1 A:

Here's a way to parse your meta-language with pyparsing:

import pyparsing as p

lbrace = p.Literal('{').suppress()
rbrace = p.Literal('}').suppress()
equals = p.Literal('=').suppress()
slash = p.Literal('/').suppress()
identifier = p.Word(p.alphas, p.alphanums)
qs = p.QuotedString('"', '\\')
tag = p.Group(
    lbrace
    + identifier.setResultsName('tag_name')
    + p.Group(p.ZeroOrMore(
        p.Group(
            identifier.setResultsName('attr_name')
            + equals
            + (identifier | qs).setResultsName('value')
        ).setResultsName('attribute')
    )).setResultsName('attributes')
    + rbrace
).setResultsName('tag')
close_tag = p.Group(
    lbrace
    + slash
    + identifier.setResultsName('tag_name')
    + rbrace
).setResultsName('closetag')

any_tag = tag | close_tag

s = """
what
{foo} {/foo} {bar baz="bat" bat="baz"}{b}dongs{/b} moredongs{/bar}
"""

print ''.join([tok.asXML() for tok, st, en in any_tag.scanString(s)])

Output:

<tag>
  <tag>
    <tag_name>foo</tag_name>
    <attributes>
    </attributes>
  </tag>
</tag>
<closetag>
  <closetag>
    <tag_name>foo</tag_name>
  </closetag>
</closetag>
<tag>
  <tag>
    <tag_name>bar</tag_name>
    <attributes>
      <attribute>
        <attr_name>baz</attr_name>
        <value>bat</value>
      </attribute>
      <attribute>
        <attr_name>bat</attr_name>
        <value>baz</value>
      </attribute>
    </attributes>
  </tag>
</tag>
<tag>
  <tag>
    <tag_name>b</tag_name>
    <attributes>
    </attributes>
  </tag>
</tag>
<closetag>
  <closetag>
    <tag_name>b</tag_name>
  </closetag>
</closetag>
<closetag>
  <closetag>
    <tag_name>bar</tag_name>
  </closetag>
</closetag>

You should look at what scanString does to see how to use this in your own code. Getting the text between tags is left as an exercise for the reader. Hint: use a list as a stack for tags.

Aaron Gallagher 2010-08-13 11:14:24

ansaurus

tags:

views:

answers:

Parse HTML-like metalanguage with Regular Expressions (Python)

related questions