ansaurus

Question

How to strip all tags from wikipedia pages or make page more readable.

Answer 1

+2 A:

You can start by taking a look at the strip_tags function.

Konamiman 2009-11-24 08:23:09

looks cool, Is there something in C# or some sort of webservice too, as I don't want to direct each page request to my webservers.

Priyank Bolia 2009-11-24 08:29:01

Answer 2

A:

What about htmlagilitypack

htmlagilitypackt

Similar thread available in stackoverflow

Is there a Wikipedia API?

Try this function.

Dim pattern As String = "<(.|\n)*?>"
Return System.Text.RegularExpressions.Regex.Replace(strHtmlString, pattern, String.Empty).Trim()

Anuraj 2009-11-24 08:44:12

Bad choice, regex is not used for HTML parsing. There are lot of question and internet articles for details. http://www.codinghorror.com/blog/archives/001311.html

Priyank Bolia 2009-11-24 08:48:12

that would create another problem in its own, how to create a webpage using the XML, then I have to write even bigger code to generate the html from the parsed XML

Priyank Bolia 2009-11-24 10:03:40

Answer 3

A:

I want to strip all tags, remove the [show][Hide] stuffs from wikipedia, or is there some website that makes pages in more readable format.

You should take a look at DBpedia, Wikipedia, but just the data.

http://dbpedia.org/About

Cups 2009-11-24 09:06:09

doesn't look the right thing, its more like semantic webpage, it just have the heading and the links and meta info about the articles. I don't need the metainfo or semantic info, I need a very simple webpage that is similar to text file without much tags except image, paragraphs, etc.

Priyank Bolia 2009-11-24 09:14:54

Answer 4

A:

You could use an HTML parser, BeautifulSoup (Python) or Simple HTML DOM for example. Or you could try using an XML parser.

Vinz 2009-11-24 10:31:29

I think the simple HTML DOM looks the best, easy and extensible.

Priyank Bolia 2009-11-24 16:10:14

ansaurus

tags:

views:

answers:

How to strip all tags from wikipedia pages or make page more readable.

related questions