wikipedia

Getting Wikipedia Article Summary using NSScanner Problem

hello, I am trying to get the summary of an article and download it as a string. This works great with some articles, but the wikipedia website is inconsistent. So NSScanner fails pretty often while it works fine for other articles. Here's my NSScanner implementation: NSString *separatorString = @"<table id=\"toc\" class=\"toc\">"; ...

problem in wikipedia api

I have problem In wikipedia api I use this php script <?php $xmlDoc = new DOMDocument(); $xmlDoc->load("http://en.wikipedia.org/w/api.php?action=query&amp;prop=revisions&amp;titles=New_York_Yankees&amp;rvprop=content&amp;format=xml"); print $xmlDoc->saveXML(); ?> & I have this result in browser .... why? Warning: DOMDocum...

Java: splitting up a large XML file with SAXParser

Hi All, I am trying to split a large XML file into smaller files using java's SAXParser (specifically the wikipedia dump which is about 28GB uncompressed). I have a Pagehandler class which extends DefaultHandler: private class PageHandler extends DefaultHandler { private StringBuffer text; ... @Override public void startEl...

Person names disambiguation

Hi, I am currently doing a project on person name disambiguation. The idea behind the project, that it will be able to identify the correct person, when there are multiple people with the same name. I have used wikipedia for this. I want to evaluate my project on some standard data. I am looking for some testing data. I am not familiar ...

Wikipedia (MediaWiki) URI encoding scheme

Folks, Anybody knows how Wikipedia or MediaWiki in general, encode the URI according to the title? It's not normal URI encoding, " "s are replaced with "_"s and single quotations are not encoded and things like that. Any reference on that? Cheers Parsa ...

Parser for wikipedia

Hi, I downloaded wikipedia dump and want to convert from wiki format to my object format. Is there a wiki parser available that converts the object into xml. Thank you ...

app engine DeadlineExceededError for cron jobs and task queue for wikipedia crawler

Hello, I am trying to build a wikipedia link crawler on google app engine. I wanted to store an index in the datastore. But I run into the DeadlineExceededError for both cron jobs and task queue. for the cron job I have this code: def buildTree(self): start=time.time() self.log.info(" Start Time: %f" % start) nobranches...

Example using WikipediaTokenizer in Lucene

Hi, I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html But I never used lucene. I just want to convert a wikipedia string into a list of tokens. But, I see that there are only four methods available in this class, end...

How can I retrieve the longest matches for substrings enclosed by "{{" and "}}" ?

I am trying to parse a wikitext file received through Wikipedia's API and the problem is that some of its templates (i.e. snippets enclosed in {{ and }}) are not automatically expanded into wikitext, so I have to manually look for them in the article source and replace them eventually. The question is, can I use regex in .NET to get the ...

Wikipedia Reader on iPhone

I want to make a Wikipedia Reader for the iPhone. What's the best approach? I've already made a few thought about that. Loading the content of the Wikipedia site is quite easy using the Wikipedia API.But the difficulty is how to display the content in a nice way. The content is marked up with wikipedia tags, not html. My idea is to pars...

How do I get all articles about people from wikipedia?

hi, What would be the easiest way to get all articles about people from wikipedia? I know I can download a dump of all the pages, but then how do I filter those and get only the ones about people? I need as many as I can get (preferably more than a million) so using any sort of API is probably not an option. ...

Does the wikipedia api support searches for a specific template?

Is it possible to query the wikipedia API for articles that contain a specific Template? The docs at: http://en.wikipedia.org/w/api.php do not describe any action that would filter search results to pages that contain a template. Specifically, I am after pages that contain Template:Persondata. After that, I am hoping to be able to retrie...