ansaurus

Question

XSLT character padding for European characters to fixed width output.

Answer 1

A:

This is not an XSLT issue, but probably an encoding issue of the output. How is your XSLT executed? Probably, you will have to change the settings for the output writer.

As Oded remarked, this might be an issue with the input reader encoding rather than an output encoding, as, according to the XPath specification, string-length counts characters, so you may be counting the characters of the string converted to more than one character for the Ä. Maybe the input is UTF-8 but your configuration reads it as single byte encoding?

Frank 2010-07-01 08:07:20

It could also be an encoding issue with the input.

Oded 2010-07-01 08:09:52

Yes, that might well be the case!

Frank 2010-07-01 08:19:01

Answer 2

A:

Are you counting bytes or characters? The Ã you are mentioning is 1 character, but 2 bytes (when using UTF-8, which seems to be the case). Characters in UTF-8 can take 1-4 bytes.

If string-length counts bytes, the result is correct.

Piskvor 2010-07-01 08:09:21

string-length counts characters, see http://www.w3.org/TR/2007/REC-xpath-functions-20070123/#func-string-length.So this is probably an issue of the input reader using a wrong encoding, as Oded proposed in his comment.

Frank 2010-07-01 08:20:43

Answer 3

+3 A:

The problem is that combining diacritical marks can be used instead of single characters. This is what gives you the "wrong length".

See http://en.wikipedia.org/wiki/Combining_character for more info on those characters.

If you have XSLT 2, there is a built-in function to normalize them which should work: fn:normalize-unicode

For XSLT 1.0, you'd have to use some function to count the characters excluding the combining characters. One possiblity may be the use of translate:

translate($input, '&#768;&#769;&#770;&#771;&#772;&#773;&#774;&#775;&#776;&#777;&#778;&#779;&#780;&#781;&#782;&#783;&#784;&#785;&#786;&#787;&#788;&#789;&#790;&#791;&#792;&#793;&#794;&#795;&#796;&#797;&#798;&#799;&#800;&#801;&#802;&#803;&#804;&#805;&#806;&#807;&#808;&#809;&#810;&#811;&#812;&#813;&#814;&#815;&#816;&#817;&#818;&#819;&#820;&#821;&#822;&#823;&#824;&#825;&#826;&#827;&#828;&#829;&#830;&#831;&#832;&#833;&#834;&#835;&#836;&#837;&#838;&#839;&#840;&#841;&#842;&#843;&#844;&#845;&#846;&#847;&#848;&#849;&#850;&#851;&#852;&#853;&#854;&#855;&#856;&#857;&#858;&#859;&#860;&#861;&#862;&#863;&#864;&#865;&#866;&#867;&#868;&#869;&#870;&#871;&#872;&#873;&#874;&#875;&#876;&#877;&#878;&#879;', '')

Note that you'll have even more problems if you have asian characters which are combined.

Quote from http://www.dpawson.co.uk/xsl/characters.html

However if the Unicode combining character is used and the input file has e' (where ' is really the combining acute character) then while any Unicode aware renderer is supposed to make this into an e acute for rendering, to an XML engine it is two characters, e and acute.

Lucero 2010-07-01 08:13:41

I think it should be emphasized that this is not an XSLT problem but a render problem: two distinct strings (one with one character and one with a character and a diacritical mark) can be render in the same way. Therefore, the problem is how to reproduce the render algorithm in XSLT (which has no reason to know beforehand).

Alejandro 2010-07-01 16:28:23

@Alejandro, you're completely right. But my suggestion basically does address exactly this: it tries to get the `string-length()` to return the rendered width instead of the character width.

Lucero 2010-07-02 11:39:08

@Lucero: Your answer is excellent and well documented. But the quot from Dave Pawson's site could be interpreted as criticism of the capabilities of XSLT (in the form XSLT should be awere of Unicode render algorithm). This is just an "editorial" comment.

Alejandro 2010-07-02 13:33:54

@Alejandro, I see. I think the problem is that most people don't know much about Unicode and the related things such as the UTF encodings, combined characters and much more. Unicode is great, but its complexity and feature set is very often underestimated. That being said, XSLT (and any other character-based processing) can usually not be aware of the rendering point of view. Things like non-breaking spaces, line breaks, tabulators etc. are also heavily dependent on the rendering, but people are aware of most of those and therefore instinctively know how they will behave or how to handle them.

Lucero 2010-07-02 14:41:32

Answer 4

+1 A:

string-length(), like all of XSLT/XPath, is character-based, not byte based, so string-length("Ãbcd") should definitely give 4. If it gives 5 then either:

your Ã is actually two separate characters, one of them a combining tilde diacritical, and it's actually correct even if it means the columns don't visually line up. But I'm guessing probably not, since the version you pasted here is a single composed character, U+00C3 LATIN CAPITAL LETTER A WITH TILDE. or,
your input XML has been read using the wrong encoding, actually being in UTF-8 (the default for XML) but having been read as something else, typically ISO-8859-1, making the U+00C3 character, represented by the byte sequence 0xC3,0x83, come out as two characters U+00C3,U+0083 (Ã).

It's not just “weird European characters” you have to worry about; if you are getting Unicode wrong then all characters outside of the basic 7-bit ASCII set are going to get mangled, including many that even insular Americans like to use.

In any case there is the question of what encoding SAP wants for its FWV input format. It's all very well treating Ã as a single character and adding the right number of padding characters for one character, but if you then output to UTF-8 and SAP doesn't actually read UTF-8, it's still going to break the import.

You'll need to find out the encoding expected by the target SAP installation (if it's not UTF-8, cp1252 is another good guess to try), and whether the fixed columns of the format are based on Unicode characters or bytes. From this (related?) spec I believe they're actually based on bytes, in which case 5 would actually be the correct byte length, if your database is supposed to contain UTF-8.

Unfortunately XSLT is all about characters and doesn't give you the chance to work with bytes, so if the input file is byte-based you'll have to either:

remove all non-ASCII characters, making the point moot, or
use another tool outside XSLT to do this processing, one that knows about bytes. To be honest this makes most sense to me: XSLT is ideal for XML-to-XML transforms and largely awful for other string processing tasks. Your template above could be made more readable and efficient re-written in a couple of lines of a modern scripting language like Python.

bobince 2010-07-01 08:34:12

ansaurus

tags:

views:

answers:

XSLT character padding for European characters to fixed width output.

related questions