JID parsing too strict?

Sat Jul 1 16:12:43 UTC 2006

The regex DJabberd uses for parsing JIDs is as follows:

^(?: ([\w\.]{1,1023}) \@)?              # optional node
            ([^/]{1,1023})               # domain
            (?: /(.{1,1023})   )?        # optional resource

So this means that:
   * The node can be up to 1023 characters in length and may only 
contain letters, digits, underscores and dots.
   * The domain can be up to 1023 characters in length and can contain 
any character except slash.
   * The resource can be up to 1023 characters in length and can contain 
anything.

Was this set of rules derived from any spec?

I ask because I note that the conference server on jabber.org gives some 
of its rooms names starting with a hash (#), which is prohibited by the 
above regex. It causes the whole JID to be treated as the domain.

The best reference I could find on the syntax of a JID is in JEP-0029, 
which is apparently retracted. However, the syntax it gives is as follows:

<JID> ::= [<node>"@"]<domain>["/"<resource>]
<node> ::= <conforming-char>[<conforming-char>]*
<domain> ::= <hname>["."<hname>]*
<resource> ::= <any-char>[<any-char>]*
<hname> ::= <let>|<dig>[[<let>|<dig>|"-"]*<let>|<dig>]
<let> ::= [a-z] | [A-Z]
<dig> ::= [0-9]
<conforming-char> ::= #x21 | [#x23-#x25] | [#x28-#x2E] |
                       [#x30-#x39] | #x3B | #x3D | #x3F |
                       [#x41-#x7E] | [#x80-#xD7FF] |
                       [#xE000-#xFFFD] | [#x10000-#x10FFFF]
<any-char> ::= [#x20-#xD7FF] | [#xE000-#xFFFD] |
                [#x10000-#x10FFFF]

...so this allows loads of different characters including the hash in 
addition to the characters allowed by DJabberd's regex.

For want of any better specification on this (anyone know where the 
*real* spec for JIDs lives?) it seems like a good idea to go ahead and 
support this grammar, since DJabberd's current regex isn't supporting 
some JIDs currently out in the wild.