tags:

views:

102

answers:

2

Hi, I am working with XPATH, Java and want to extract some text out of one html page. The text is located under some div with some whitespace characters in between, like &nbsp; <br> etc. I want these to be converted into 'space' and 'newline' respectively while extracting. The method I am using to extract text is Element.getTextContent() which does not respect whitespace characters.

Could somebody tell me if there is a way to extract text with whitespace normalization OR Extract whole html markup under the 'Node' so that i could replace it by myself. Thanks Nayn

+1  A: 

XPath cannot replace nodes with strings.

A simple XSLT transformation can carry out this task.

For example:

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match="text()">
   <xsl:value-of select="translate(.,'&#xA0;', ' ')"/>
 </xsl:template>

 <xsl:template match="br">
   <xsl:text>&#10;</xsl:text>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the following XML document:

<p>&#xA0;<br/></p>

the wanted result is produced:

<p> 

</p>
Dimitre Novatchev
This is useful for my future needs. Thanks.
Nayn
+1  A: 

<br> isn't text content, it's an element. I'm not sure what you're looking for. Try just visiting all the text nodes underneath the element (remembering to recursively check element children) and calling getNodeValue();

Adrian Mouat
This one was simple. The problem was that, getTextContent concatenates all the strings ignoring   and <br>. I wrote a small recursive method that inserts spaces in between texts. Thanks.
Nayn