tags:

views:

106

answers:

6

Hi!

I want to parse html tags manually with Regular Expressions.

I see next problems:

  1. sting with tags: <h1 title="<this not a tag>">...</h1> - how to parse correct this tag, if in string parameter appear tags?
  2. tag can be incorrect like this: <h1 title="not closed string> or this <h1 title=not opened string"> - this tag also should be parsed correctly

Generally - I should get from HTML document all tags <...> and do anithing with this tags. This code I can write with Regular Expressions (in my case this is Python).

Please, can you help me in this. Thanks!

Incorrect solutions is:

  • <[^<|]+?> - we not parse this: <h1 title="<not a tag>">
A: 

try beautifulsoup instead:

http://www.crummy.com/software/BeautifulSoup/

Here's a tidbit on lxml and beautifulsoup:

http://codespeak.net/lxml/elementsoup.html

Eric Snow
A: 

If you're looking for a nice, robust, powerful HTML parser, lxml is where it's at. Way better than BeautifulSoup in basically every way.

Aaron Gallagher
+1  A: 

How to parse HTML with regular expressions: Don't

HTML is too complex to be parsed by regular expressions. Find yourself an HTML parser and learn it.

haydenmuhl
It's less of 'too complex' and more of 'HTML isn\'t a regular language.'
Aaron Gallagher
I know that, but I don't expect everyone to be familiar with the Chomsky hierarchy or formal grammars.
haydenmuhl
+3  A: 

Your question in some way includes the answer. You want to parse HTML and at first glance Regular Expressions look ideal for the job. However, even with a quick look you're finding problems which seem difficult to solve.

It turns out the answer is not a more complex regular expression; in fact that approach will end in madness.

The clever thing to do now is realise you've made a mistake that lots of other people have made, and step away from the Regular Expression and find the correct tool for the job.

The formal answer is that HTML is not a Regular language and so can't be successfully parsed with Regular Expressions.

The right tool to use is an HTML Parser. Have a look at Beautiful Soup, lxml or sgmllib.

Dave Webb
+2  A: 

Reading your comment, I see that you don't want to parse HTML -- you want to parse your own custom template library. As all the other posters have said, don't use regex.

The correct solution is pyparsing.

katrielalex
+1  A: 

Here's a way to parse your meta-language with pyparsing:

import pyparsing as p

lbrace = p.Literal('{').suppress()
rbrace = p.Literal('}').suppress()
equals = p.Literal('=').suppress()
slash = p.Literal('/').suppress()
identifier = p.Word(p.alphas, p.alphanums)
qs = p.QuotedString('"', '\\')
tag = p.Group(
    lbrace
    + identifier.setResultsName('tag_name')
    + p.Group(p.ZeroOrMore(
        p.Group(
            identifier.setResultsName('attr_name')
            + equals
            + (identifier | qs).setResultsName('value')
        ).setResultsName('attribute')
    )).setResultsName('attributes')
    + rbrace
).setResultsName('tag')
close_tag = p.Group(
    lbrace
    + slash
    + identifier.setResultsName('tag_name')
    + rbrace
).setResultsName('closetag')

any_tag = tag | close_tag

s = """
what
{foo} {/foo} {bar baz="bat" bat="baz"}{b}dongs{/b} moredongs{/bar}
"""

print ''.join([tok.asXML() for tok, st, en in any_tag.scanString(s)])

Output:

<tag>
  <tag>
    <tag_name>foo</tag_name>
    <attributes>
    </attributes>
  </tag>
</tag>
<closetag>
  <closetag>
    <tag_name>foo</tag_name>
  </closetag>
</closetag>
<tag>
  <tag>
    <tag_name>bar</tag_name>
    <attributes>
      <attribute>
        <attr_name>baz</attr_name>
        <value>bat</value>
      </attribute>
      <attribute>
        <attr_name>bat</attr_name>
        <value>baz</value>
      </attribute>
    </attributes>
  </tag>
</tag>
<tag>
  <tag>
    <tag_name>b</tag_name>
    <attributes>
    </attributes>
  </tag>
</tag>
<closetag>
  <closetag>
    <tag_name>b</tag_name>
  </closetag>
</closetag>
<closetag>
  <closetag>
    <tag_name>bar</tag_name>
  </closetag>
</closetag>

You should look at what scanString does to see how to use this in your own code. Getting the text between tags is left as an exercise for the reader. Hint: use a list as a stack for tags.

Aaron Gallagher