It is often handy to have a canonical representation of a language (in my case they are usually domain specific languages); however, I believe there are strict limits on the expressiveness of the languages involved that determine whether a canonical form can be determined and/or created for an arbitrary program in that language. Unfortunately, I've been unable to find the references that I (vaguely) recall reading about this in.
On one hand, it seems reasonable that creating a canonical representation of a language is of comparable complexity to many hard graph problems (eg: graph isomorphism), but on the other hand, iirc, compilers like gcc, yhc, and ghc use intermediate representations to generate outputs in various formats (assembly, javascript, etc.), so this is at least in some forms, a solved problem.
When is it possible to determine / generate a canonical form for a given language? (How expressive can that language be, and how does language expressiveness impact the utility of the canonical forms?) Please provide references or proofs if at all possible.
Edit: For example, a Regular Language (eg: the 'pure' form of regular expressions) can not express many of the same things that a Turing-complete language can. In other words, you can't write a web server in a regular language, but you can with lambda calculus). My question is about the theoretical possibilities, and does have a specific answer relating to complexity theory. If I have a DSL that needs to be transmitted to another system, it will often be beneficial to generate a canonical form of that code before transmitting it, since that will decouple the independent representations used by the two different systems. However, if it is P-Space complete, or NP-Complete to translate a Turing-complete language into a canonical form, then you shouldn't waste time trying to build a canonical form -- either find another way to do it, or reduce the language complexity to something that can be canonicalized in polynomial time.