JID parsing too strict?
Martin Atkins
mart at degeneration.co.uk
Sat Jul 1 16:12:43 UTC 2006
The regex DJabberd uses for parsing JIDs is as follows:
^(?: ([\w\.]{1,1023}) \@)? # optional node
([^/]{1,1023}) # domain
(?: /(.{1,1023}) )? # optional resource
So this means that:
* The node can be up to 1023 characters in length and may only
contain letters, digits, underscores and dots.
* The domain can be up to 1023 characters in length and can contain
any character except slash.
* The resource can be up to 1023 characters in length and can contain
anything.
Was this set of rules derived from any spec?
I ask because I note that the conference server on jabber.org gives some
of its rooms names starting with a hash (#), which is prohibited by the
above regex. It causes the whole JID to be treated as the domain.
The best reference I could find on the syntax of a JID is in JEP-0029,
which is apparently retracted. However, the syntax it gives is as follows:
<JID> ::= [<node>"@"]<domain>["/"<resource>]
<node> ::= <conforming-char>[<conforming-char>]*
<domain> ::= <hname>["."<hname>]*
<resource> ::= <any-char>[<any-char>]*
<hname> ::= <let>|<dig>[[<let>|<dig>|"-"]*<let>|<dig>]
<let> ::= [a-z] | [A-Z]
<dig> ::= [0-9]
<conforming-char> ::= #x21 | [#x23-#x25] | [#x28-#x2E] |
[#x30-#x39] | #x3B | #x3D | #x3F |
[#x41-#x7E] | [#x80-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]
<any-char> ::= [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
...so this allows loads of different characters including the hash in
addition to the characters allowed by DJabberd's regex.
For want of any better specification on this (anyone know where the
*real* spec for JIDs lives?) it seems like a good idea to go ahead and
support this grammar, since DJabberd's current regex isn't supporting
some JIDs currently out in the wild.
More information about the Djabberd
mailing list