ansaurus

Question

Answer 1

A:

that's most definitely not the way to do it, in any language.

if any site online will expose their data in a nice way, it'll be wikipedia.

look into getting an article as XML, as RDF, or maybe even as JSON.

Oren Mazor 2009-10-27 20:09:30

That's what I'm asking, if there is any way to retrieve the data in a nice format, but it doesn't seem like it, from what I've seen.

Jorge Israel Peña 2009-10-27 20:23:52

Answer 2

+1 A:

I'm going to go with suggesting regex for targeted data extraction in a mixed HTML data stream.

There are already RegEx libraries on the phone, they are sort of hidden though - you can expose them with a few simple calls using RegexKitLite (make sure to scroll down and get the light version). It ends up being a class with a few extensions on NSString that lets you do regexs, then you would define a regex with two captured matches - one for the number, and one for the content, along with a number of non-captured matches for the enclosing and intermediate tags. Even though it's a "lite" version of standard RegEX it sill supports just about any ability you would need.

The API approach is promising but once you get the raw markup you're probably going to have to take a similar regex approach to parsing data out of that. It still might make sense if it reduces regex complexity and data transfer time though, no reason you can't combine both approaches.

Kendall Helmstetter Gelner 2009-10-27 20:45:33

Thanks for that, I appreciate it. I think the way I'm gonna go (The only way I can see of doing this) is getting the bit of raw data and then somehow parsing it. I've included an example of the data above, though I will most likely create a new question for that.

Jorge Israel Peña 2009-10-27 21:49:17

That new data is much easier to parse - I'd handle that by looking for the string range that starts after Events, then doing a match against bracketed pure numbers, along with anything after the ndash up to the end of the line... then you'd just need to strip out all "[" and "]" characters and you'd be all set. Easier to process than the HTML though which is super link heavy.

Kendall Helmstetter Gelner 2009-10-28 01:27:38

Thanks, would you mind replying to my subsequent question regarding the parsing? http://stackoverflow.com/questions/1634012/how-to-parse-some-wiki-markup Thanks!

Jorge Israel Peña 2009-10-29 15:47:17

I'll add a comment later today, the regex for that should not be too hard...

Kendall Helmstetter Gelner 2009-10-29 18:09:10

Answer 3

+1 A:

Given that pages on Wikipedia are stored as plaintext, and input by users as plaintext, you're not going to get a structured data set from it.

kprevas 2009-10-27 20:45:53

Answer 4

+2 A:

Add a &format=fmt to the end of your query, as described at API:Data_formats. Your query becomes: JSON query, for example. You can specify XML, JSON, or many other formats.

You can easily parse the overall sections, and then just display the HTML formatted output into a webview.

Matt B. 2009-10-27 20:49:12

Thanks! Yeah I had seen that, but the returned file is a lot larger than the raw file I was able to retrieve. The downside is that it's in wiki markup instead of HTML, but I wasn't planning on rendering the returned content into a webview anyways. I'd rather have the actual data so that I can manipulate its presentation easily. I appreciate the response though.

Jorge Israel Peña 2009-10-27 20:54:20

Answer 5

+2 A:

I have scraped a lot of data from WP in various ways. the format depends on a lot of things including what type of subdomain the information is in and when it was entered. The main text is free format and there is no simple way to scrape it. The infoboxes are in a special WP format which has changed over the years. It wasn't designed to be scraped.

There is a database backing WP which is somewhat more structured.

By far your best strategy is to contact the Wikipedians in the domain you wish to scrape - they will know about the database format and may well be able to help - they will certainly want to help as they will want to see WP in semantic form (such as DBPedia - http://dbpedia.org/About).

peter.murray.rust 2009-10-27 21:07:59

Answer 6

+2 A:

Does Python count? ;) It is accessible from Objective-C. And there are great modules for scraping purposes: Beautiful Soap and/or mechanize, you can also consider lxml.

piobyz 2009-10-27 21:09:43

Answer 7

A:

I've got an iPhone app which does a screen scrape using the following:

YQL (http://developer.yahoo.com/yql)
Yahoo's Objective-C Libraries (http://github.com/yahoo/yos-social-objc)

Using YQL you can get whatever information you need from the web by using XPATH queries against the DOM.

Personally I think its much better than using Regex. Then again I only know very simple regular expressions.

nolim1t 2009-10-28 10:54:24

ansaurus

tags:

views:

answers:

Scraping and Parsing a Wikipedia Page

related questions