views:

991

answers:

3

Question

Using XSLT 1.0, given a string with arbitrary characters how can I get back a string that meets the following rules.

  1. First character must be one of these: a-z, A-Z, colon, or underscore
  2. All other characters must be any of those above or 0-9, period, or hyphen
  3. If any character does not meet the above rules, replace it with an underscore

Background

In an XSLT I'm translating some attributes into elements, but I need to be sure the attribute doesn't contain any values that can't be used in an element name. I don't care much about the integrity of the attribute being converted to the name as long as it's being converted predictably. I also don't need to compensate for every valid character in an element name (there's a bunch).

The problem I was having was with the attributes having spaces coming in, which the translate function can easily convert to underscores:

translate(@name,' ','_')

But soon after I found some of the attributes using slashes, so I have to add that now too. This will quickly get out of hand. I want to be able to define a whitelist of allowed characters, and replace any non-allowed characters with an underscore, but translate works as by replacing from a blacklist.

+1  A: 

As far as Im aware XSLT 1.0 doesnt have a builtin for this. XSLT 2.0 allows you to use regexes, though Im sure your all too aware of that.

If, on the off chance your using the MS parser, you can write .NET extension libraries that you can leverage in your XSLT and I wrote about this some months ago here.

If your using something like Saxon, I am pretty certain they also provide ways of coding your own extensions, and they may indeed have an extension of their own already, but Im unfamiliar with that engine.

Hope this helps.

Jim Burger
Oh how I wish we could use XSLT 2.0 :/ You should see the stuff we have to do for dates in our xslts...Extending would be a great way to solve it but we're unable to do that because of third-party constraints.Thanks for your answers, they're very good options, just ones I can't personally use.
phloopy
I completely empathise, I was in a similar position a few years ago. I could write the templates but a third party was in charge of the parser, and thus string manipulation was quite hard.
Jim Burger
+2  A: 

You could write a recursive template to do this, working through the characters in the string one by one, testing them and changing them if necessary. Something like:

<xsl:template name="normalizeName">
  <xsl:param name="name" />
  <xsl:param name="isFirst" select="true()" />
  <xsl:if test="$name != ''">
    <xsl:variable name="first" select="substring($name, 1, 1)" />
    <xsl:variable name="rest" select="substring($name, 2)" />
    <xsl:choose>
      <xsl:when test="contains('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_', $first) or
                      (not($first) and contains('0123456789.-', $first))">
        <xsl:value-of select="$first" />
      </xsl:when>
      <xsl:otherwise>
        <xsl:text>_</xsl:text>
      </xsl:otherwise>
    </xsl:choose>
    <xsl:call-template name="normalizeName">
      <xsl:with-param name="name" select="$rest" />
      <xsl:with-param name="isFirst" select="false()" />
    </xsl:call-template>
  </xsl:if>
</xsl:template>

However, there is shorter way of doing this if you're prepared for some hackery. First declare some variables:

<xsl:variable name="underscores"
  select="'_______________________________________________________'" />
<xsl:variable name="initialNameChars"
  select="'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_'" />
<xsl:variable name="nameChars"
  select="concat($initialNameChars, '0123456789.-')" />

Now the technique is to take the name and identify the characters that aren't legal by replacing all the characters in the name that are legal with nothing. You can do this with the translate() function. Once you've got the set of illegal characters that appear in the string, you can replace them with underscores using the translate() function again. Here's the template:

<xsl:template name="normalizeName">
  <xsl:param name="name" />
  <xsl:variable name="first" select="substring($name, 1, 1)" />
  <xsl:variable name="rest" select="substring($name, 2)" />
  <xsl:variable name="illegalFirst"
    select="translate($first, $initialNameChars, '')" />
  <xsl:variable name="illegalRest"
    select="translate($rest, $nameChars, '')" />
  <xsl:value-of select="concat(translate($first, $illegalFirst, $underscores),
                               translate($rest, $illegalRest, $underscores))" />
</xsl:template>

The only thing you have to watch out for is that the string of underscores needs to be long enough to cover all the illegal characters that might appear within a single name. Making it the same length as the longest name you're likely to encounter will do the trick (though probably you could get away with it being a lot shorter).

JeniT
I was trying to steer clear of the recursive template route for efficiency. I figured there had to be a better way to do it and your very hackish double translate solution is exactly what I was looking for. I'm actually laughing at how hacky it is, but I love it. Thanks!
phloopy
A: 

As another alternative there is a string function that might work for you in the XSLT standard library. http://xsltsl.sourceforge.net/string.html#template.str:string-match

Jim Burger