views:

125

answers:

7

Is it appropriate to use XML tags (element names) written in non-ASCII natural languages? The XML spec allows it (see Names and Exceptions), but I couldn't find any best practices about this at W3C and related pages.

What I'm looking for is practical advice regarding which tools support this, whether important XML-related technologies such as XSLT and XForms may have problems with it, etc.

I think Andrey and Tomalak are missing the point. XML is not necessarily read by programmers, it is read by many different professionals. So the arguments comparing it to source code don't necessarily apply.

Let me clarify: I mean a Bulgarian legal domain, where many terms are specific to the Bulgarian legal process, and may not even have exact English translations. Translating them would be laborious, imprecise and impractical. Transliterating to ASCII is suboptimal.

So back to the question: what tool limitations would I face? (Eclipse supports UTF, so writing xpaths wouldn't be a problem.)

To get people started in the technical direction that I'd like: in several systems we've used generation techniques to ensure perfect correspondence between XML schemas, Java beans and database schemas.

+4  A: 

It is bad idea, as giving names to variables in native languages. You automatically make your program unreadable for majority of developers.

Andrey
XML <> program. XML is often read by professionals other than developers
Vladimir Alexiev
@Vladimir Alexiev replace word developers with professionals, meaning will remain the same.
Andrey
+2  A: 

Short answer: You can name your XML elements any way you please.

Slightly longer answer: If you want to use the most portable/maintainable XML, you should probably use ASCII-only element names. I can think of no good reason to use other characters in the element name, and it certainly helps dealing with the XML in all kinds of places.

Think of handling XML nodes with some programming language that does not necessarily have its source code files UTF-8 encoded. You would have a hard time writing working XPath expressions, for example, in such a language. Or maintainers/programmers who do not speak the language that your element names are in, but are in charge of the source code. You are kind of locking yourself in when your element names are in Cyrillic script, for example. Element names should carry structure and meaning, and there is no apparent reason that would rule out ASCII for that purpose.

Tomalak
I wonder how hard it would be for you to think of good reasons if Latin letters were as foreign to you as Cyrillic is.
Michael Borgwardt
@Michael: See my "Short answer". Apart from that. If Latin letters are so foreign to you as Cyrillic ones are to me (I can read Cyrillic, BTW), then you very probably are not a programmer and do not have problems involving XML files in the first place. This has nothing to do with my personal exposure to one foreign script or the other, ASCII *is* the least common denominator when it comes to communicating with a computer.
Tomalak
A: 

It depends on you and your development rules. But XML tag names should be easily readable and understandable by everyone. Even the one joins you after sometime should also get it properly. So better to name them as per proper naming conventions.

Check the example as below.

<user name="hero">     
  <address>
     <street></street>    
  </address>    
</user>

thanks.

Paarth
"Proper naming conventions" cannot mean "excludes Cyrillic" and "understandable by everyone" cannot mean "English-readers, maybe developers". How about Bulgarian legal professionals?
Vladimir Alexiev
+2  A: 

Write your XML in whatever language you like. Make sure that the encoding supports the character set you are using, and that you state the correct encoding in the XML processing directive.

That will help to separate tools that support XML from tools that claim to do so, and which actually don't.

John Saunders
+1, but I think Vladimir's question is more along the lines of 'Which common XML tools have technical problems with non-latin tags (despite the fact that the spec allows them)?'.
whybird
I don't know of any that have such a problem. Any that have problems should be publicized and publicly ridiculed.
John Saunders
Totally agree..
whybird
+4  A: 

If the content of the documents will be in Bulgarian then the markup should be able to be.

If your tool chain can't parse the tags in that language then how can you be sure that it is handling the content correctly?

Programmers will always have to learn the language of the target domain, whether it be finance, genetics, engineering or the Bulgarian legal system. Compromising usability for the convenience of the programmer is almost always a 'Bad Thing'. Whatever effort is saved up front ends up getting lost as impeded end user productivity and in support effort/cost over the lifetime of the product.

Matthew S
+1, also this is what XML was designed for! :)
Porges
A: 

I'm sorry to say this, but if your non-technical users needs to read raw XML, your application is broken. And the data you store will not usually have a 1-1 correspondence with user messages, either: many things are stored in a redundant way on XML, and other things are implicit from the data.

For me, I think you should, yes, store all your XML data in Bulgarian, using the UTF-8 character set. But in attributes, not in the XML tag structure.

I am thinking on this: you could design your program so that any of the legal structure can be modified freely from the user interface (maybe on a special "admin" panel, but still far from the code), and in no way hard-coded to the file format. The reason for this is that laws change, jurisprudence change and legal terms may change as well. (Well, some don't)

This may enable you to create a fairly general file format (think about one that could be used on US or japan, too - even if you don't plan to actually do it, that way your changes of designing a flexible file format will be greater)

This may be harder. You need to be prepared to handle with inconsistent, incomplete or otherwise poor data. But you should be doing this, anyway. And you may be rewarded, too: the file format could be cleaner and future-proof, making your software more flexible. Or maybe not. Notice the mays, coulds here. It actually depends on your specific design trade-offs.

And, of course, you need to have some balance here. In the end of the day, the burden of designing a reliable, flexible system is on you. You may take the approach of writing the tags in Bulgarian. I'm from Brazil, and I find odd to think about something like , but it could work.

About your actual concerns on tool limitations: I have no idea. You should first look for the documentation of you favorite XML library and see if it boldly claims to support it. Even the most used programs may not support fully a feature that is not much used.

Why are you thinking "one application"? The context is design of eGovernment XML exchange patterns and rules. Of course we'll look at GJXDM and NIEM (the US have done a great job, expressing a police state in IT ;-). Notice that GJXDM/NIEM have many US ideosyncracies, eg "apparent academic term of an international exchange student", so it's not a world-wide universal legal XML schema. Similarly, there are many BG-specific legal concepts that are best expressed in Bulgarian.
Vladimir Alexiev
A: 

what tool limitations would I face?

If I recall correctly, the set of allowed characters in XML names was originally different in XML 1.0 and XML 1.1 the latter one allowing also some previously excluded South-East Asian scripts. There was a change in the fifth (=latest) edition of XML 1.0 recommendation and now the allowed name characters are the same. So at least theoretically it could be possible that some tools, that are claimed to be XML 1.0 compatible, have problems with these new allowed characters if they check for name character validity and conform only to fourth edition of XML 1.0.

But in your case this problem is merely theoretical if you only use ASCII and Bulgarian characters.

jasso