[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
I18N domain names (was: Re: foreign languages)
- Date: Wed, 11 Dec 1996 11:30:45 +0100 (MET)
- From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
- Subject: I18N domain names (was: Re: foreign languages)
Working on the internationalization (i18n) of URLs, I was considering
various ways of internationalizing domain names, i.e. allowing
characters outside [A-Za-z0-9\-] in domain names.
The recent posting of a proposal by Michael Dillon has triggered
me to split out my proposal for domain names as a separate draft.
I will submit it as an internet draft as soon as possible (on Dec.
16th). Because of its length (27k) I don't want to attach it here.
Please donwload it from:
ftp://ftp.ifi.unizh.ch/pub/multilingual/draft-duerst-dns-i18n-00.txt
Please have a look at it and send me comments (or post them to this
list if they are relevant to this list).
Here is the abstract:
Internet domain names are currently limited to a very restricted
character set. This document proposes the introduction of a new
"zero-level" domain (ZLD) to allow the use of arbitrary characters
from the Universal Character Set (ISO 10646/Unicode) in domain names.
The proposal is fully backwards compatible and does not need any
changes to DNS.
What I call a ZLD above is syntactically a TLD and therefore has some
connection with the discussion on this list. However, semantically
it is very close to in-addr.arpa, which has a completely different
reason for existence, and is managed in a completely different way,
than all the other TLDs and SLDs.
I therefore think that the current IAHC discussion can be conducted
independently of this proposal, and the ZLD can be created by a separate
decision after reaching consensus on domain name i18n. Nevertheless,
I think that it is good for the IAHC members and discussion participants
to keep such things as the need for ZLDs for structural reasons in mind.
What is of more importance for the IAHC discussion than the creation
of ZLDs is how rules on iTLDs translate (or don't translate) to rules
on the creation of i18n iTLDs (these would syntactically be SLDs,
but semantically TLDs). My draft contains some remarks on this
topic, but not too much (it is not an easy issue to think about
the effects of a given set of iTLD rules in all kinds of scripts
and languages if one has no idea about what these rules may look like).
If somebody thinks that my draft should be submitted as an IAHC
draft, please tell me.
On Fri, 6 Dec 1996, Michael Dillon wrote:
> On Fri, 6 Dec 1996, Simon Higgs wrote:
>
> > It's more a question of supportable character sets in DNS. How about
> > Russian, Chinese or Japanese TLDs?
>
> I had an idea that would deal with this that I've been mulling over for a
> while. The following very rough document was my first attempt to
> systematize the idea. You'll note the intention to encompass UNICODE
> however the rough draft below still falls short in number of code points
> although it might handle everything up to but not including chinese.
Michael's draft contains some very good ideas, and I am especially
grateful to him that he brought up this topic and made me rush out
my ideas. I hope we can work together to find a solution for i18n
domain names very soon.
For those interested in the details, here are the main differences
between my approach and Michael's:
- To distinguish between current domain names and i18n domain names,
I use a ZLD, whereas Michael uses a leading "-" in each
affected domain name part. Michael's solution is more
flexible in that it is easier to internationalize only
part of a full domain name. However, it has been pointed
out that using a leading "-" may not be backwards compatible.
- My encoding scheme can encode all of UCS4 (2Giga characters), whereas
Michael's scheme currently goes up to only ~1600 characters,
which is not enough.
- Michael combines the issue of what characters from Unicode are
allowed in domain names with the encoding (only those
characters that are allowed can be encoded, the others
don't get a number). In my proposal, any characters
can be encoded, and separate guidelines specify
which characters must not be used. This simplifies
the definition and the implementation of the encoding.
With the enormous number of characters in Unicode, and
with new characters being added continuously, having
to decide for each character whether or not it can
be used in a domain name before knowing how this
character can be encoded is not feasible.
- Michael's encoding tries to be as compact as possible by
using base 36, with modifications for the first
position to create a variable-length encoding.
I use a consistent base 16 scheme, with a fifth
bit used for length indication. This makes
implementation more straightforward.
- Michael preserves the letters A-Z. I in many cases
preserve the hexadecimal representation of
the ISO 10646 code value. Both schemes help
in understanding the encoded domain names where
this is necessary (e.g. debugging). But outside A-Z,
Michael's scheme does not help, whereas my scheme
can very well be used together with a Unicode book
for all characters.
Looking forward to your comments, Martin.