tags:

views:

613

answers:

4

I need to match a string holiding html using a regex to pull out all the nested spans, I assume I assume there is a way to do this using a regex but have had no success all morning.

So for a sample input string of

<DIV id=c445c9c2-a02e-4cec-b254-c134adfa4192 style="BORDER-RIGHT: #000000 1px solid; BORDER-TOP: #000000 1px solid; BORDER-LEFT: #000000 1px solid; BORDER-BOTTOM: #000000 1px solid; BACKGROUND-COLOR: #eeeeee">
<SPAN id=b8db8cd1-f600-448f-be26-2aa56ea09a9c>
<SPAN id=304ccd38-8161-4def-a557-1a048c963df4>
<IMG src="http://avis.co.uk/Assets/build/menu.gif"&gt;
</SPAN>
</SPAN>
<SPAN id=bc88c866-5370-4c72-990b-06fbe22038d5>
<SPAN id=55b88bbe-15ca-49c9-ad96-cecc6ca7004e>UK<BR></SPAN>
</SPAN>
<SPAN id=52bb62ca-8f0a-42f1-a13b-9b263225ff1d>
<SPAN id=0e1c3eb6-046d-4f07-96c1-d1ac099d5f1c>
<IMG src="http://avis.co.uk/Assets/build/menu.gif"&gt;
</SPAN>
</SPAN>
<SPAN id=4c29eef2-cd77-4d33-9828-e442685a25cb>
<SPAN id=0d5a266a-14ae-4a89-9263-9e0ab57f7ad2>Italy</SPAN>
</SPAN>
<SPAN id=f0a72eea-fddd-471e-89e6-56e9b9efbece>
<SPAN id=b7d9ada7-ade0-49fe-aa5f-270237e87c2b>
<IMG src="http://avis.co.uk/Assets/build/menu.gif"&gt;
</SPAN>
</SPAN>
<SPAN id=7604df94-34ba-4c89-bf11-125df01731ff>
<SPAN id=330d6429-4f1b-46a2-a485-9001e2c6b8c1>Netherlands</SPAN>
</SPAN>
<SPAN id=a18fb516-451e-4c32-ab31-3e3be29235f6>
<SPAN id=6c70238d-78f9-468f-bb8d-370fff13c909>
<IMG src="http://avis.co.uk/Assets/build/menu.gif"&gt;
</SPAN>
</SPAN>
<SPAN id=5a2465eb-b337-4f94-a4f8-6f5001dfbd75>
<SPAN id=47877a9e-a7d5-4f13-a41e-6948f899e385>Malta &amp; Gozo

i would want to get each outer span and its containing span so in the above text there should be Eight results

Any help gladly accepted

+5  A: 

Once again use an HTML parser to walk the DOM: regexs will never be robust enough to do this.

annakata
regexHtmlParserQuestions++ ;-)
Tomalak
I think this is a good name for a tag ;-)
bruno conde
you forgot the "++" :)
annakata
Obviously it has to be: ++regexHtmlParserQuestions++;
Tomalak
+4  A: 

It's actually impossible to solve this using standard regular expression, since they basically implement type 3 grammars in the Chomsky hierarchy (finite state automata), whereas you need at least a type 2 grammar (some sort of stack or recursion) to correctly recognize arbitrary nested structures.

However, if you restrict the maximal possible nesting level, then it's probably possible, but I still doubt whether regexps are the best solution.

Michael Borgwardt
.NET's implementation supports counting, so it is possible to match begin / end tags etc. but for anything beyond simple matching a real parser would be the way to go.
Brian Rasmussen
A: 

Basically, I agree with the above advices, using regexes for parsing HTML is asking for having code breaking some day on strange legal HTML constructs (not to mention malformed HTML that browsers accept...). Finding and using a good HTML parser can be rewarding in many ways...

Now, I am pragmatic (and I can't resist a little regex challenge...) and sometime I use REs against machine generated HTML (often an export feature), because I know the structure I see is unlikely to change, unlike hand-generated pages where the author can make typos... It is mostly for quick hacks I can adapt if output ever change.

In your case, the HTML is quite regular, linear and predictable, so the RE is quite simple. I give Java code because I don't know C# but adaptation should be trivial.

Pattern p = Pattern.compile("(<SPAN id.*?<SPAN id.*?</SPAN></SPAN>)");
Matcher m = p.matcher(html);
while (m.find())
{
  System.out.println(m.group(1));
}

HTH.

PhiLho
A: 

Try this:

@"(?is)<SPAN\b[^>]*>\s*(<SPAN\b[^>]*>.*?</SPAN>)\s*</SPAN>"

This is basically the same as PhiLho's regex, except it permits whitespace between the tags at either end. I also had to add the SingleLine/DOTALL modifier to accomodate line separators within the matched text. I don't know if either of those changes was really necessary; the sample data the OP posted was all on one line, but PhiLho broke it up (thereby breaking his own regex).

Alan Moore