views:

208

answers:

3

Hello,

I am writing an iPhone app that has to pull raw HTML data off a website an grab the url of the links and the displayed text of a link.

For example in the like <a href="www.google.com">Click here to go to google</a>

It would pull grab url = www.google.com text = Click Here to go to google

I'm using the regexlite library but i'm in no way an expert on regular expressions i have tried several things to get this working.

I want to use the following code

NSString *searchString  = @"$10.23, $1024.42, $3099";
NSString *regexString   = @"\\$((\\d+)(?:\\.(\\d+)|\\.?))";
NSArray  *capturesArray = NULL;

capturesArray = [searchString arrayOfCaptureComponentsMatchedByRegex:regexString];

So my question is can someone tell me what the searchString would be to parse html links or point me to a clear tutorial on how regexlite works i have tired reading the documentation at http://regexkit.sourceforge.net/RegexKitLite/ and i dont understand it.

Thanks in advance,

Zen_silence

+4  A: 

In short, don't do that. Regular expressions are a horrible way to parse HTML. HTML documents are highly structured with a hierarchy of tags whose contents may span lines without said lines appearing in the rendered form.

Assuming well structured HTML, you can use an XML parser.

In particular, the iPhone offers the NSXMLParser and some good examples of usage therein.

bbum
That would be great if HTML were actually highly structured - you said it yourself, "Assuming well-structured HTML". In the general case you cannot actually assume that, and it's nuts to try and parse a whole HTML as a DOM when you just want a link out.
Kendall Helmstetter Gelner
Yah -- but, in the case of badly structured HTML, you'll often run into anchors that are across multiple lines and otherwise goofy to the point of confusing a regex. Best to use an HTML parser of some kind and deal with whatever DOM it spews, if you need to deal with broken input.
bbum
I'm using very bad HTML i have no control over the structure so it's easier if i just grab everything that is a link i have already got my html to a managable size using substring searches. I would rather use regular expressions to grab the couple things i need. Unless you can point me to a good html parsing wrapper because the one provided in the SDK is not great i have used it before. I have tried the hpple library but i couldnt figure out how to get it running
Zen_silence
bbum: Even if you have anchors across multiple lines you can tell the regex to ignore linefeeds (which you would for a whole text regex anyway). If you want to parse most of a text walking the DOM makes more sense I agree, but if you just need a little info embedded in the middle of a lot of real-world html regex is way more flexible and less brittle, and probably even better performance wise since you don't spend all your time constructing nodes.
Kendall Helmstetter Gelner
A: 

searchString would be the whole raw HTML text, and regexString should be more like:

NSString *regexString = @"href=\"(.*)\">(.*)<";

Then you would use capturing matches to pull out match1 and match2, repeating the match through the HTML text using the Range option for searching so that you would skip past what you had already searched...

I don't know what you are trying to do with searchString and the numbers though.

Kendall Helmstetter Gelner
searchString is just an example string that i was playing with to try and learn regular expressions.I thought with regular expressions i could make an array of of the two matches.
Zen_silence
You kind of can - if you have fixed text you are looking for, you can do multiple matches in one regex. For something like that though it probably would be better to do multiple matches, if possible limiting the range of text you are searching.I highly recommend a good book on RegEx, like the Oreilly book "Mastering Regular Expressions". It's really like a whole other programming language, and very powerful
Kendall Helmstetter Gelner
Excellent book Kendall I have a friend who happened to have a copy. He let me borrow it just quickly flipping though it i was able to form this line: NSString *regexString = @"<a href=\"([a-zA-Z0-9\\.\\?\\]*[>]?)";That string gets the URL now i have no idea how to pull out the text inbetwen the <a href> </a> tags.I also have my code dumping each match group into its own place in an multi D Array
Zen_silence
A: 

In case anyone else has this same question the regex string to match an html link is

NSString *regexString = @"<a href=([^>]*)>([^>]*) - ";

The Oreilly book "Mastering Regular Expressions" helped me figure this out really quickly i highly recommend reading if you are trying to use regular expressions.

Zen_silence