tags:

views:

272

answers:

3

I have a regex to extract the text from an HTML font tag:

<FONT FACE=\"Excelsior LT Std Bold\"(.*)>(.*)</FONT>

That's working fine until I have some nested font tags. Instead of matching

<FONT FACE="Excelsior LT Std Bold">Fett</FONT>

the result for string

<FONT FACE="Excelsior LT Std Bold">Fett</FONT> + <U>Unterstrichen</U> + <FONT FACE="Excelsior LT Std Italic">Kursiv</FONT> und Normal

is

<FONT FACE="Excelsior LT Std Bold">Fett</FONT> + <U>Unterstrichen</U> + <FONT FACE="Excelsior LT Std Italic"

How do I get only the first tag?

+3  A: 

You must use the non-greedy star:

<FONT FACE=\"Excelsior LT Std Bold\"[^>]*>(.*?)</FONT>
                                    ^^^^^  ^^^
                                      |     |
     match any character except ">" --+     +--------+
                                                     |
   match anything, but only up to the next </FONT> --+

The usual warnings about using regex to process HTML apply: You shouldn't.

Tomalak
+8  A: 

You need to disabale greedy matching with .*? instead of .*.

<FONT FACE=\"Excelsior LT Std Bold\"([^>]*)>(.*?)</FONT>

Note that this will fail if there is a attribute like BadAttribute="<FooBar>" somewhere after the FACE attribute for the <FONT> tag. This will mix both matching groups and it could get completly messed up if an attribute would contain </FONT>. There is no way araound this because regular expressions cannot count matching tags or quotes. So I absolutly agree with Tomalak - try to avoid using regular expressions for processing XML, HTML, and other markup up languages like these.

Daniel Brückner
+2  A: 

you need to use a non-greedy capture denoted by '?'

 <FONT FACE=\"Excelsior LT Std Bold\"(.*?)>(.*?)</FONT>
Jeremy