[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

I18N domain names (was: Re: foreign languages)



Working on the internationalization (i18n) of URLs, I was considering
various ways of internationalizing domain names, i.e. allowing
characters outside [A-Za-z0-9\-] in domain names.

The recent posting of a proposal by Michael Dillon has triggered
me to split out my proposal for domain names as a separate draft.
I will submit it as an internet draft as soon as possible (on Dec.
16th). Because of its length (27k) I don't want to attach it here.
Please donwload it from:

ftp://ftp.ifi.unizh.ch/pub/multilingual/draft-duerst-dns-i18n-00.txt

Please have a look at it and send me comments (or post them to this
list if they are relevant to this list).

Here is the abstract:

   Internet domain names are currently limited to a very restricted
   character set. This document proposes the introduction of a new
   "zero-level" domain (ZLD) to allow the use of arbitrary characters
   from the Universal Character Set (ISO 10646/Unicode) in domain names.
   The proposal is fully backwards compatible and does not need any
   changes to DNS.

What I call a ZLD above is syntactically a TLD and therefore has some
connection with the discussion on this list. However, semantically
it is very close to in-addr.arpa, which has a completely different
reason for existence, and is managed in a completely different way,
than all the other TLDs and SLDs.

I therefore think that the current IAHC discussion can be conducted
independently of this proposal, and the ZLD can be created by a separate
decision after reaching consensus on domain name i18n. Nevertheless,
I think that it is good for the IAHC members and discussion participants
to keep such things as the need for ZLDs for structural reasons in mind.

What is of more importance for the IAHC discussion than the creation
of ZLDs is how rules on iTLDs translate (or don't translate) to rules
on the creation of i18n iTLDs (these would syntactically be SLDs,
but semantically TLDs). My draft contains some remarks on this
topic, but not too much (it is not an easy issue to think about
the effects of a given set of iTLD rules in all kinds of scripts
and languages if one has no idea about what these rules may look like).
If somebody thinks that my draft should be submitted as an IAHC
draft, please tell me.


On Fri, 6 Dec 1996, Michael Dillon wrote:

> On Fri, 6 Dec 1996, Simon Higgs wrote:
> 
> > It's more a question of supportable character sets in DNS. How about
> > Russian, Chinese or Japanese TLDs?
> 
> I had an idea that would deal with this that I've been mulling over for a
> while. The following very rough document was my first attempt to
> systematize the idea. You'll note the intention to encompass UNICODE
> however the rough draft below still falls short in number of code points
> although it might handle everything up to but not including chinese.

Michael's draft contains some very good ideas, and I am especially
grateful to him that he brought up this topic and made me rush out
my ideas. I hope we can work together to find a solution for i18n
domain names very soon.

For those interested in the details, here are the main differences
between my approach and Michael's:

- To distinguish between current domain names and i18n domain names,
	I use a ZLD, whereas Michael uses a leading "-" in each
	affected domain name part. Michael's solution is more
	flexible in that it is easier to internationalize only
	part of a full domain name. However, it has been pointed
	out that using a leading "-" may not be backwards compatible.

- My encoding scheme can encode all of UCS4 (2Giga characters), whereas
	Michael's scheme currently goes up to only ~1600 characters,
	which is not enough.

- Michael combines the issue of what characters from Unicode are
	allowed in domain names with the encoding (only those
	characters that are allowed can be encoded, the others
	don't get a number). In my proposal, any characters
	can be encoded, and separate guidelines specify
	which characters must not be used. This simplifies
	the definition and the implementation of the encoding.
	With the enormous number of characters in Unicode, and
	with new characters being added continuously, having
	to decide for each character whether or not it can
	be used in a domain name before knowing how this
	character can be encoded is not feasible.

- Michael's encoding tries to be as compact as possible by
	using base 36, with modifications for the first
	position to create a variable-length encoding.
	I use a consistent base 16 scheme, with a fifth
	bit used for length indication. This makes
	implementation more straightforward.

- Michael preserves the letters A-Z. I in many cases
	preserve the hexadecimal representation of
	the ISO 10646 code value. Both schemes help
	in understanding the encoded domain names where
	this is necessary (e.g. debugging). But outside A-Z,
	Michael's scheme does not help, whereas my scheme
	can very well be used together with a Unicode book
	for all characters.


Looking forward to your comments,	Martin.