views:

447

answers:

1

Hi, I have the Lucene search extension (http://www.mediawiki.org/wiki/Extension_talk:Lucene-search) integrated with my mediawiki installation. Its all working really well, however- lucene seems to have indexed all the mediawiki /html markup as well and it is showing up in the results.

i.e. searching for "green" will return results with markup such as, style="background:green; color:white

Is there a way to strip the search results of all the markup? I believe wikipedia uses the same search plugin, how are they doing it?

+2  A: 

You will probably have to transform the raw wiki markup before indexing it with Lucene. When dealing with pure XML content, it's possible to just use an XSL transform with <xsl:value-of select="text()"/> to extract the text content.

I'm afraid that won't work for wiki markup, but maybe you can capture the page post-HTML transformation?

cbeer