Parsing source of a webpage with Objective-C

views:

685

answers:

+2 Q:

Parsing source of a webpage with Objective-C

Is there a way to parse a website's source on the iPhone to get the URL's of photos on that page? If so how would you do that?

Thanks

+1 A:

You could try it using regular expressions, but I wouldn't recommend that. You should have a look at NSXMLParser, assuming the webpage is coded to be XHTML compliant. TouchXML is another good library.

Dan Lorenc 2009-07-07 20:44:33

+2 A:

There is no super easy way. When I had to do it I wrote a libxml2 SAX parser. libxml2 has an html reader that works fairly well with malformed html, and libxml2 is included with the base system.

Louis Gerbarg 2009-07-07 20:45:09

take a look at Event Driven XML Parsing in the iPhone reference library

ctshryock 2009-07-07 20:50:52

Are you OK with any approach you use not picking up on images loaded dynamically via javascript.

The closest thing I could see working is to parse out any javacript imports, load those up too, and then use a regular expression across the whole file looking for anything that ends in ".jpg/.gif/.png" and grab the full URL out from that. The libxml approach would miss out on references to images not in img tags... But it might well be good enough.

Kendall Helmstetter Gelner 2009-07-07 21:08:05

+3 A:

I'd say go for regular expressions - there is a one page library that wraps c regexesthat you can drop into your project.

2009-07-07 21:22:23

Agree, you don't need to parse the entire doc just to get the <img> tags.

Marco Mustapic 2009-07-07 21:43:53

+2 A:

I recommend regular expressions. There's a great open source Regex library for Cocoa called RegexKit. For the most part, you can just drop it in your code and it'll "just work".

Getting all the urls of images wouldn't be too difficult (less than 20 lines of code) if you assume that all images are going to be in <img> tags. You'd just grab all the image tags (something like: <img\s+[^>]+>), then iterate through those matches. For each match, you'd pull out whatever's in the src attribute: src\s*=\s*("|')?\s*([^\s"']+)(\s|"|')

You might need to tweak that a bit, but it shouldn't be too bad.

Dave DeLong 2009-07-07 21:45:10

ansaurus

tags:

views:

answers:

Parsing source of a webpage with Objective-C

related questions