tags:

views:

43

answers:

4

Hello,

I've been racking my brain over this but can't seem to get it right, and I'm not hitting the correct keywords on Google..

I've recently started to play around with XSLT and XPath to create an XML description of natural language glossaries – for a project of mine.

The problem is that I have chosen to use 'mixed content' complex elements for some words and in some instances want to fetch just the text node.

Here's a portion of the XML document:

...
<entry category="substantiv">
  <word lang="sv">semester</word>
  <word lang="de">
    <article>der</article>Urlaub
    <plural>Urlaube</plural>
  </word>
</entry>
...

There are many entry-elements in my document, and in this instance I want to fetch 'Urlaub' by using: /entry/word[@lang='de']/text(), which because of my linebreaks wont work. I've discovered that there are actually three text nodes.. .../text()[2] will work of course.. However, I don't know beforehand where there will be linebreaks, or how many. If the xml is formated like the following, my first version of the path will work but not the second:

...
<word lang="de"><article>der</article>Urlaub
  <plural>Urlaube</plural>
</word>
...

What I think I want to do is select all the immediate text nodes of word[@lang='de'], and then remove unnecessary white space using normalize-space(). However, how do I do this using XPath? Or is there a better way? It seems like it would be easy but I can't figure it out. I am by the way trying to do this within an XSLT document.

normalize-space(/entry/word[@lang='de']/text()[*]) is one of the things I have tried, but that seems to do something else.

/Grateful for any help.

Update:

Here is part of the XSLT, as requested:

...
<xsl:choose>
    <xsl:when test="@category='substantiv'">
        <em><xsl:value-of select="word[@lang='de']/article" /></em>
        <xsl:value-of select="normalize-space(word[@lang='de']/text()[2])" />
        <em>pl. <xsl:value-of select="word[@lang='de']/plural" /></em>
    </xsl:when>
...

This code works just fine with the first version of formating. To clarify, what I want to do is to grap the value of the text node in the complex element <word lang="de">, despite however it might be formated with line breaks and white spaces. What I will do with the value depends on context, but right now I will just put it in an xhtml doc.

Update2: I am now using <xsl:strip-space elements="*"/> which eliminates the problem of having empty text nodes. I am also using:

...
<xsl:choose>
  <xsl:when test="@category='substantiv'">
    <em><xsl:value-of select="word[@lang='de']/article" /></em>
    <xsl:text> </xsl:text>
    <xsl:value-of select="normalize-space(word[@lang='de']/text())" />
    <xsl:text>, </xsl:text>
    <em>pl. <xsl:value-of select="word[@lang='de']/plural" /></em>
  </xsl:when>
...

Still have to normalize though since a space is still added after "Urlaub" in the XML.

When I need to reach the text node "Urlaub" outside of the XSLT document I use:
<xsl:value-of select="normalize-space(word[@lang='de']/text()[normalize-space() != ''])" />

Thanks for all the help folks!

Update 3: Tried to improve the title

A: 

Now that I see your code I recommend this:

<xsl:choose>
  <xsl:when test="@category='substantiv'">
    <em><xsl:value-of select="word[@lang='de']/article" /></em>^
    <!-- select the first non-empty text node and normalize it -->
    <xsl:value-of select="normalize-space(word[@lang='de']/text()[normalize-space() != ''][1])" />
    <em>pl. <xsl:value-of select="word[@lang='de']/plural" /></em>
  </xsl:when>

Original Version of the answer

To get you started:

<entry category="substantiv">
  <word lang="sv">semester</word>
  <word lang="de">
    <article>der</article>Urlaub
    <plural>Urlaube</plural>
  </word>
</entry>

When passed through this XSLT 1.0:

<!-- identity template copies everything 1:1, unless other templates apply -->
<xsl:template match="*|@*">
  <xsl:copy>
    <xsl:apply-templates select="*|@*" />
  </xsl:copy>
</xsl:template>

<!-- empty template: ignore every white-space-only text-node child of <word> -->
<xsl:template match="word/text()[normalize-space() = '']" />

Would produce this:

<entry category="substantiv">
  <word lang="sv">semester</word>
  <word lang="de"><article>der</article>Urlaub<plural>Urlaube</plural></word>
</entry>

This answer is a guess and may not be exactly what you are after. Your question needs clarification in any case. Not always is what you think you want the same as what you actually want.

Tomalak
Ah, yes I was not clear at all. I didn't want to change the formating, only handle different scenarios of formating. But you helped me with something else so your answer was still useful. Thanks! :)
nimbus77
@nimbus: Did you notice that the top section of my answer changed?
Tomalak
Yes I did, that change does the trick. Thanks for helping out. I'm a bit confused now though as to how exactly text() is supposed to work, but I'll start a new question tomorrow for that if I can't figure it out.
nimbus77
@nimbus: `text()` is, despite the parentheses, not a function. At least not the way you probably think it would be. It selects text nodes, the same way as `foo` would select `<foo>` elements. The parentheses are a way to separate it from `text`, which would select `<text>` elements.
Tomalak
@Tomalak: Yea I was fooled by that. I also found out today that it is called a node test. I also thought it would automatically concatenate the text nodes into one string like if I had ended the XPath with: `word[@lang="de"]`. But, now I know better. :)
nimbus77
A: 

Try:

/entry/word[@lang='de']/child::text()[normalize-space(.) != '']

Meaning, grab all child text nodes but not those that normalize to an empty string.

-Oisin

x0n
Mentioning the `child::` axis is superfluous. Also, `normalize-space()` operates on the current node by default, so mentioning it though `.` is not necessary.
Tomalak
@x0n, typing word[@lang='de']/text()[normalize-space() != ''] does the trick. Thanks!
nimbus77
A: 

I think this is the skeleton of what you want, minus any normalize-space() to get things to look exactly the way you want.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:template match="/">
    <xsl:apply-templates select=".//word"/>
  </xsl:template>
  <xsl:template match="word">
    <xsl:apply-templates select=".//text()"/>
  </xsl:template>
  <xsl:template match="text()"><xsl:value-of select="."/><xsl:text> </xsl:text></xsl:template>  
</xsl:stylesheet>

The key is the .//text() which returns the concatenation of ALL child text nodes at any nesting level below the context node().

Jim Garrison
That's what I thought `.//text()` would do to.. Maybe I'm doing it wrong? If I use `<xsl:value-of select="normalize-space(word[@lang='de']//text())" />` (haven't started using templates yet, going to though) I get nothing. But if I test it in my XPath evaluator it finds 5 possible text nodes, since 'der' and 'Urlaube' are also added.
nimbus77
@Jim: *"The key is `the .//text()` which returns the concatenation of ALL child text nodes"* - Actually, that's wrong. `//text()` *selects* all the text nodes, it returns a node-set of separate nodes, not a concatenated string.
Tomalak
+2  A: 

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:value-of select="/*/entry/word[@lang='de']/text()[1]"/>
 </xsl:template>
</xsl:stylesheet>

when applied on the provided XML document (wrapped in a dict top element):

<dict>
    <entry category="substantiv">
        <word lang="sv">semester</word>
        <word lang="de">
            <article>der</article>Urlaub
            <plural>Urlaube</plural>
        </word>
    </entry>
</dict>

produces exactly the wanted result:

Urlaub

Do note: the use of the <xsl:strip-space> instruction to eliminate all white-space-only text nodes from the source XML document.

Therefore, no additional processing (normalize-space(), etc) is necessary.

Dimitre Novatchev
That was a really nice solution. Vielen Dank! :)
nimbus77
Turns out there is still white space after "Urlaub" but that is not a problem.
nimbus77