ansaurus

Question

Answer 1

A:

Well, when using the Wiki source itself you could just strip out all templates at the start. This might work well enough for most articles that have infoboxes or some messages at the top.

However, some articles might put the starting blurb into a template itself so that would be a little difficult there.

Another way, perhaps more reliable, would be to take the contents of the first <p> tag that appears directly in the article text (so not nested in a table or so). This should strip out infoboxes and other stuff at the start as those are probably (I'm not exactly sure) <table>s or <div>s.

Generally, Wikipedia is written for human consumption with only very minimal support for anything semantic. That makes automatic extraction of specific information from the articles pretty painful.

Joey 2009-10-14 10:10:41

Answer 2

+2 A:

You need a parser that can read Wikipedia markup. Try WikiText or the parsers that come with XWiki.

That will allow you to ignore anything you don't want (headlines, tables).

Aaron Digulla 2009-10-14 10:12:58

Answer 3

A:

Try dbpedia

Ken 2009-10-14 11:05:32

Answer 4

A:

As you expect, you will probably have to end up parsing the source, the compiled HTML, or both. However, the Wikipedia:Lead_section may give you some indication of what to expect in well-written articles.

Tim 2009-10-14 22:10:59

Answer 5

+3 A:

I was also in the same need and wrote some Python code to do that.

The script downloads the wikipedia article with given name, parses it using BeautifulSoup and returns first few paragraphs.

Code is at http://github.com/anandology/sandbox/blob/master/wikisnip/wikisnip.py.

Anand Chitipothu 2009-10-15 07:06:34

A wonderfully pragmatic solution, but note that this solution is dependent on how the wiki markup is transformed to HTML. If you can, I'd suggest parsing the wiki markup directly.

gnud 2009-10-15 07:11:16

I tried. But it turned out very hard because the markup contains function calls of the form `{{...}}`. For example, `{{convert|1.2|km|mi|spell=us}}`. Here is my attempt: http://github.com/anandology/sandbox/blob/master/wikipedia/wikitext.py

Anand Chitipothu 2009-10-15 11:15:16

Answer 6

+1 A:

I opened the Albert Einstein article in Firefox and I clicked on View source. It's pretty easy to parse using an HTML parser. You should focus on the <p> and strip the other html from within it.

Geo 2009-10-15 12:17:05

Answer 7

+2 A:

Wikipedia offers an Abstracts download [1]. While this is quite a large file (currently 2.5 Gb), it offers exactly the info you want, for all articles.

[1] http://download.wikimedia.org/enwiki/latest/

PanMan 2009-10-15 12:26:09

Answer 8

+1 A:

For example if you have the result in a string you would find the text:

<div id="bodyContent">

and after that index you would find the first

<p>

that would be the index of the first paragraph you mentioned.

try this url Link to the content (just works in the browser)

Gabriel Guimarães 2009-10-15 12:45:48

Thanks for the answer, this lead me into my solution above (selecting the first paragraph of the bodyContent-div.

theomega 2009-10-16 17:59:34

Answer 9

+1 A:

I worked out the following solution: Using a xpath-query on the XHTML-Source-Code (I took the print-version because it is shorter, but it also works on the normal version).

//html/body//div[@id='bodyContent']/p[1]

This works on German and on English Wikipedia and I haven't found and article where it doesn't output the first paragraph. The solution is also quite fast, I also thought of only taking the first x chars of the xhtml, but this would render the xhtml invalid.

If someone is searching for the JAVA-Code here it is then:

private static DocumentBuilderFactory dbf;
static {
 dbf = DocumentBuilderFactory.newInstance();
 dbf.setAttribute("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
}
private static XPathFactory xpathf = XPathFactory.newInstance();
private static String xexpr = "//html/body//div[@id='bodyContent']/p[1]";


private static String getPlainSummary(String url) {
 try {
  // OPen Wikipage
  URL u = new URL(url);
  URLConnection uc = u.openConnection();
  uc.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1) Gecko/20090616 Firefox/3.5");
  InputStream uio = uc.getInputStream();
  InputSource src = new InputSource(uio);

  //Construct Builder
  DocumentBuilder builder = dbf.newDocumentBuilder();
  Document docXML = builder.parse(src);

  //Apply XPath
  XPath xpath = xpathf.newXPath();
  XPathExpression xpathe = xpath.compile(xexpr);
  String s = xpathe.evaluate(docXML);

  //Return Attribute
  if (s.length() == 0) {
   return null;
  } else {
   return s;
  }
 }
 catch (IOException ioe) {
  logger.error("Cant get XML", ioe);
  return null;
 }
 catch (ParserConfigurationException pce) {
  logger.error("Cant get DocumentBuilder", pce);
  return null;
 }
 catch (SAXException se) {
  logger.error("Cant parse XML", se);
  return null;
 }
 catch (XPathExpressionException xpee) {
  logger.error("Cant parse XPATH", xpee);
  return null;
 }
}

use it by calling getPlainSummary("http://de.wikipedia.org/wiki/Uma_Thurman");

theomega 2009-10-16 17:56:34

ansaurus

tags:

views:

answers:

Get first lines of Wikipedia Article

related questions