views:

44

answers:

3

Is it possible/practical to build a single regular expression that matches hierarchical data?

For example:

<h1>Action</h1>
  <h2>Title1</h2><div>data1</div>
  <h2>Title2</h2><div>data2</div>
<h1>Adventure</h1>
  <h2>Title3</h2><div>data3</div>

I would like to end up with matches.

"Action", "Title1", "data1"
"Action", "Title2", "data2"
"Adventure", "Title3", "data3"

As I see it this would require knowing that there is a hierarchical structure at play here and if I code the pattern to capture the H1, it only matches the first entry of that hierarchy. If I don't code for H1 then I can't capture it. Was wondering if there are any special tricks I an employ to solve this.

This is a .NET project.

A: 

Regex does not work for this type of data. It is not regular, per se.

You should use an XML parser for this.

Jeff B
+2  A: 

It's generally considered bad practice to attempt to parse HTML/XML with RegEx, precisely because it's hierarchical. You COULD use a recursive function to do so, but a better solution in this case is to use a real XML parser. I couldn't give you better advice than that without knowing the platform you're using.

EDIT: Regex is also very slow, which is another reason it's bad for processing HTML; however, I don't know that an XML/DOM processor is likely to be faster since it's likely to use a lot more memory.

If you JUST want data from a simple document like you've demonstrated, and/or if you want to build a solution yourself, it's not that tough to do. Just build a simple, recursive state-based stream processor that looks for tags and passes the contents to the the next recursive level.

For example:

- In a recursive function, seek out a "<" character.
- Now find a ">" character.
- Preserve everything you find until the next "<" character.
- Find a ">" character.
- Pass whatever you found between those tags into the recursive function.

You'd have to work out error checking yourself, but the base case (when you return back up to the previous level) is just when there's nothing else to find.

Maybe this helps, maybe not. Good luck to you.

Brian Lacy
+3  A: 

The solution is to not use regular expressions. They're not powerful enough for this sort of thing.

What you want is a parser - since it looks like you're trying to match HTML, there are plenty to choose from.

Anon.
@snives yeh depending on the language -- antlr / lex-yacc / spirit would do the trick. Put this comment here so you can google them .
Hassan Syed
Agreed, nice succinct answer.
Brian Lacy
Interesting, I'll check into these, thank you.
Snives