tags:

views:

51

answers:

1

Hi all,

Im currently working on a "proper" URI validator and currently it all comes down to hostname validation, the rest isnt that tricky.

Im stuck at IDN hostname labels (e.g. containing unicode; possible punycode encoded strings have been decoded at this point).

My first idea was basicly a regex for TLD's not supporting IDN and one for those who do (http://www.mozilla.org/projects/security/tld-idn-policy-list.html (?)).

Respectively; ^[a-zA-Z0-9-]+$ and ^[a-zA-Z0-9-\p{L}]+$

However this is not an ideal situation, since every IDN registrar can decide which characters to allow and which not.

What im looking for is a proper, consistent, up2date data table of unicode characters allowed in various TLD's; im getting this idea i have to find all the data myself at russian and chinese registry sites (which is quite difficult).

So before spitting down the web.. i wondered is there such a list? Or are there better approaches, best/common practices etc? (I want the validation to be as strict as possible.)

Any help is welcome!

// Roland

A: 

Can't you convert all Unicode domains to punycode and validate that? Since DNS doesn't support real UTF-8 chars anyways, this might be the best solution.

Byron Whitlock
True.. i thought of that too. However its about user input.. i cant tell my users to fill in uri's converted to punycode first.So that leaves me with (what you probably meant) converting it internally to punycode... still this not means the hostname has to be really valid (correct me if im wrong), so in that case matching any unicode character (\p{L}) and considering it as valid is basicly the same thing. The last option will be my fallback method if i cant come to a good solution; if this is going to be the case would you suggest holding on to the list mozilla provides (e.g. 2 regexes)?
Roland Franssen
To clearify above;TLD's listed on mozzilla -> [a-zA-Z0-9\-\p{L}] / All other TLD's -> [a-ZA-Z0-9\-]Would this be proper validation?
Roland Franssen