views:

123

answers:

3

Hi All,

I've got a string with very unclean HTML. Before I parse it, I want to convert this:

<TABLE><TR><TD width="33%" nowrap=1><font size="1" face="Arial">
NE
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
DEK
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
143
</font> </TD>
</TR></TABLE>

in NE DEK 143 so it is a bit easier to parse. I've got this regular expression (RegexKitLite):

NSString *str = [dataString stringByReplacingOccurrencesOfRegex:@"<TABLE><TR><TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<\\/TR><\\/TABLE>" 
                                                     withString:@"$1 $3 $5"];

I'm no an expert in Regex. Can someone help me out here?

Regards, dodo

+1  A: 

Amarghosh, and bobince, the winning answerer of linked question, is generally right about this. However, since you are just sanitising, regexps are actually just fine.

First, strip the tags:

s/<.*?>//

Then collapse all extra spaces into one:

s/\s+/ /

Then remove leading/trailing space:

s/^\s+|\s+$//

Then get the values:

^([^ ]+) ([^ ]+) ([^ ]+)$
Delan Azabani
-1: `s/<.*>//` will erase all of `<TD>NE</TD>`
Dan
No it won't; .* is not greedy.
Delan Azabani
Actually, .* is, but .*? isn't. I'll update the answer.
Delan Azabani
Now it works; thanks for picking that up.
Delan Azabani
What about a tag with an embedded newline? Or something like this: `<!-- a > b || c < d -->`
Dan
As most of you know, regexps are /bad/ for parsing markup. My answer was just there because OP asked for a regexp method to extract a few data pieces. Again, regexps are /bad/ because they don't catch edge cases, like you pointed out.
Delan Azabani
A: 

I have a few suspicions about why your regex might fail (without knowing the rules for string escaping in the iPhone SDK): The dot . used in places where it would have to match newlines, the slash looks like it's escaped unnecessarily etc.,

but: in your example, the text you're trying to extract is characterized by not being surrounded by tags.

So a search for all occurences of (?m)^[^<>\r\n]$ should find all matches.

Tim Pietzcker
A: 

If you sure of your html-code hierarchy, then you can just extract text enclosed by font-tags:

Regex r = Regex(@"<\s*font((\s+[^<>]*)|(\s*))>(?<desiredText>[^<>]*)<\s*/\s*font\s*>")
//C# example
foreach(Match m in r.Matches(txt))
   result += m.Groups["desiredText"].Value.Trim()

; It will be text enclosed by font-tags without white-space symbols by edges.

chapluck