[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
foreign languages
- Date: Fri, 6 Dec 1996 17:51:40 -0800 (PST)
- From: Michael Dillon <michael@memra.com>
- Subject: foreign languages
On Fri, 6 Dec 1996, Simon Higgs wrote:
> > be created/reserved. The Internet is global, but national/regional
> > interests do conflict at times and there are many languages: being truly
> > global requires taking this aspect into consideration.
>
> It's more a question of supportable character sets in DNS. How about
> Russian, Chinese or Japanese TLDs?
I had an idea that would deal with this that I've been mulling over for a
while. The following very rough document was my first attempt to
systematize the idea. You'll note the intention to encompass UNICODE
however the rough draft below still falls short in number of code points
although it might handle everything up to but not including chinese.
Network Working Group Michael Dillon
Request for Comments: #### Memra Software Inc.
28 November 1996
Multilingual Domain Names
Status of this Memo
This memo specifies an Internet standard for multilingual domain
names using any character set which can be represented in Unicode.
Distribution of this memo is unlimited.
Overview and Rationale
The current system of domain names restricts the names to the digits
0 through 9, the 26 letters of the English alphabet and the dash (-).
The letters are not case sensitive thus Memra.COM is equivalent to
memra.com. This system poses problems for people using other lan-
guages with character sets that cannot be mapped directly to the
English alphabet. While an approximate mapping can often be achieved
for many languages which use a Latin based alphabet by using non-
accented versions of accented characters, this often results in con-
fusing representations of words which are normally distinguished only
by different accents.
Thus, this proposal suggests a way in which any language with codes
defined in the Unicode character set can be used in domain names
without any negative empact on the existing system.
Distinguishing New from Old
In order to supply the information necessary to distinguish other
character sets, we must conform to the existing standards for domain
naming while introducing an escape sequence that notifies newer soft-
ware that the domain name is actually an encoded form. It is proposed
that the dash serve this purpose when it occurs as the first charac-
ter of a domain name segment.
A fully qualified domain name such as www.memra.com consists of more
than one segment seperated by dots. Currently the dash is only used
rarely and never occurs at the beginning of a segment. This proposal
would reserve the dash at the beginning of a segment to signify that
the remaining characters in the segment constitute an encoded form of
Unicode. The characters following the segment will be interpreted as
digits for a base 36 number which represents the code position of the
Unicode characters in a table derived from Unicode.
Dillon [Page 1]
RFC #### Multilingual Domain Names 28 November 1996
The base 36 number is interpreted in groupings of one, two or three
digits from left to right as follows:
A-Z - single digit code (1 through 26)
1-8 - double digit code (27 through 314)
9 - three digit code (315 through 1610)
The base 36 digits are as follows:
0 - 0
A-Z - 1-26
1-9 - 27-36
The table of characters refernced by the numbers is drawn from the
Unicode by removing lower case characters and other instances where
two glyphs represent the same symbol in an unambiguous way. The first
twenty-six positions in this table will contain the letters A-Z so
that an encoded containing Latin letters can be more easily recog-
nized by people. Under this encoding scheme the domain name
-memra.-com would be equivalent to memra.com however unless a .-com
domain is officially created by IANA this name could not be used on
the global Internet. Top level domains beginning with a dash are more
likely to be created to represent Cyrillic, or Japanese names.
Examples
Here are some examples of domains using the new system.
The French word for "where" is represented as the letter O followed
by U with a grave accent. According to our system this would be
encoded as the two numbers 0015 0049. But using our base 36 encoding
scheme we get O as the single digit representation for 0015 and 1W
for 0049 thus we could represent the french translation of where.fr
as -o1w.fr
Table of Characters
This table is not yet worked out other than the first 26 positions.
A sample table is included here to illustrate some sample domain
names.
0001 A
0002 B
0003 C
0004 D
0005 E
0006 F
0007 G
Dillon [Page 2]
RFC #### Multilingual Domain Names 28 November 1996
0008 H
0009 I
0010 J
0011 K
0012 L
0013 M
0014 N
0015 O
0016 P
0017 Q
0018 R
0019 S
0020 T
0021 U
0022 V
0023 W
0024 X
0025 Y
0026 Z
0027 0
0028 1
0029 2
0030 3
0031 4
0032 5
0033 6
0034 7
0035 8
0036 9
0037 Eacute
0038 Egrave
0039 Ecirc
0040 Aacute
0041 Agrave
0042 Auml
0043 Icirc
0044 Iuml
0045 Ouml
0046 Ocirc
0047 Uuml
0048 Uacute
0049 Ugrave
Security Considerations
This RFC raises no security issues.
Author's Address
Dillon [Page 3]
RFC #### Multilingual Domain Names 28 November 1996
Michael Dillon
Memra Software Inc.
C-4 Powerhouse, RR #2
Armstrong, BC V0E 1B0
CANADA
Phone: +1-250-546-8022
Fax: +1-250-546-3049
EMail: michael@memra.com
Dillon [Page 4]
Michael Dillon - Internet & ISP Consulting
Memra Software Inc. - Fax: +1-604-546-3049
http://www.memra.com - E-mail: michael@memra.com