views:

354

answers:

3

For now I'm using this regexp:

^\A([a-z0-9\.\-_\+]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})\Z$

I think it is not very good. So what's the best regular expression you have or have seen for validating jids?

For reference, Section 3 of the XMPP core standard defines a JID in Augmented Backus-Naur Form as

jid             = [ node "@" ] domain [ "/" resource ]
domain          = fqdn / address-literal
fqdn            = (sub-domain 1*("." sub-domain))
sub-domain      = (internationalized domain label)
address-literal = IPv4address / IPv6address
+3  A: 

Your regexp is wrong at least in the following points:

  1. It requires jid to contain a '@', though jids without a '@' may also be valid.
  2. It doesn't check the maximal length (but the link you provided says "Each allowable portion of a JID MUST NOT be more than 1023 bytes in length")

I think having one huge regexp is a wrong way to go. You'd better write some more code, splitting the jid into smaller parts (domain, node, resource) and then checking each of those parts. That would be better from multiple points:

  • easier testing (you can unit test each of the parts independently)
  • better performance
  • simpler code
  • reusability
  • etc.
Olexiy
Thanks for the tip.
Anton Mironov
+1  A: 

Try this:

^(?:([^@/<>'\"]+)@)?([^@/<>'\"]+)(?:/([^<>'\"]*))?$

It's not quite right, since there are lots of things that match it that aren't valid JIDs, particularly in the domain name portion. However, it should allow and parse all valid JIDs, with group 1 being the node, group 2 being the domain, and group 3 being the resource.


Test Data:

foo                 (None,  'foo', None)
[email protected]     ('foo', 'example.com', None)
[email protected]/bar ('foo', 'example.com', 'bar')
example.com/bar     (None,  'example.com', 'bar')
example.com/bar@baz (None,  'example.com', 'bar@baz')
example.com/bar/baz (None,  'example.com', 'bar/baz')
bär@exämple.com/bäz ('bär', 'exämple.com', 'bäz')


Aside: if you aren't familiar with the construct (?:), it's a set of parens that doesn't add a group to the output.

Joe Hildebrand