| 1 |
IMAP4r1 Mailbox Names vs. Unicode |
| 2 |
================================= |
| 3 |
:author: Matthias_Andree_(ed.)_and_Mark_Crispin |
| 4 |
:email: matthias.andree@gmx.de |
| 5 |
:author initials: MA and MC |
| 6 |
:revision: 1.001 |
| 7 |
:revdate: 2010-05-28 |
| 8 |
:toc: |
| 9 |
:data-uri: |
| 10 |
:icons: |
| 11 |
:numbered: |
| 12 |
|
| 13 |
'''' |
| 14 |
|
| 15 |
.Acknowledgment |
| 16 |
**** |
| 17 |
This article would not have been possible without the |
| 18 |
substantial contributions from Mark Crispin. |
| 19 |
— Matthias Andree, editor |
| 20 |
**** |
| 21 |
|
| 22 |
.Abstract |
| 23 |
**** |
| 24 |
IMAP4rev1 is a widely used Internet Standards Track Protocol for remote |
| 25 |
email access. Its adoption to international environments posed |
| 26 |
interpretation problems as the construction and interpretation of |
| 27 |
mailbox names, it particularly raised the question if there was |
| 28 |
contractictory information within IMAP4rev1. |
| 29 |
|
| 30 |
This article describes the problem, and shows that IMAP4rev1 is |
| 31 |
consistent with respect to mailbox names. We document how the evolution |
| 32 |
of Unicode character sets and transformation formats made the |
| 33 |
interpretation of the IMAP4rev1 standard difficult, and how it is to |
| 34 |
interpret properly. |
| 35 |
|
| 36 |
Finally, we show that UTF-7, which is used in IMAP4rev1 to encode |
| 37 |
mailbox names, does not impose artificial restrictions on the Unicode |
| 38 |
character set. |
| 39 |
**** |
| 40 |
|
| 41 |
== IMAP Mailbox Names in RFC-3501 |
| 42 |
|
| 43 |
In May 2010, some confusion arose on the getmail mailing list around a bug |
| 44 |
report to Debian that complained getmail4 wouldn't allow non-ASCII characters |
| 45 |
in an IMAP folder name http://bugs.debian.org/513116[Debian Bug#513116], and |
| 46 |
the interpretation of support of international mailbox names |
| 47 |
vs. http://tools.ietf.org/html/rfc3501[RFC-3501]. It seemed at first |
| 48 |
glance that IMAP4rev1 were limited to the Basic Multilingual Plane of |
| 49 |
Unicode. |
| 50 |
|
| 51 |
=== Problem statement |
| 52 |
|
| 53 |
Notably, RFC-3501 mandates that mailbox names are 7-bit, however clients are |
| 54 |
supposed to accept 8-bit data and interpret it as UTF-8. This is apparently |
| 55 |
contradictory or extraneous, because 7-bit ASCII data need not be encoded. |
| 56 |
|
| 57 |
Let us look at the IMAP4rev1 standard: |
| 58 |
|
| 59 |
[quote, Mark Crispin, RFC3501] |
| 60 |
____ |
| 61 |
5.1. Mailbox Naming |
| 62 |
|
| 63 |
Mailbox names are 7-bit. Client implementations MUST NOT attempt to |
| 64 |
create 8-bit mailbox names, and SHOULD interpret any 8-bit mailbox names |
| 65 |
returned by LIST or LSUB as UTF-8. Server implementations SHOULD |
| 66 |
prohibit the creation of 8-bit mailbox names, and SHOULD NOT return |
| 67 |
8-bit mailbox names in LIST or LSUB. See section 5.1.3 for more |
| 68 |
information on how to represent non-ASCII mailbox names. [...] |
| 69 |
____ |
| 70 |
|
| 71 |
[quote, Mark Crispin, RFC3501] |
| 72 |
____ |
| 73 |
5.1.3. Mailbox International Naming Convention |
| 74 |
|
| 75 |
By convention, international mailbox names in IMAP4rev1 are specified |
| 76 |
using a modified version of the UTF-7 encoding described in [UTF-7]. |
| 77 |
Modified UTF-7 may also be usable in servers that implement an earlier |
| 78 |
version of this protocol. [...] |
| 79 |
____ |
| 80 |
|
| 81 |
This appears to be contradictory, because UTF-7 is not UTF-8. However, a UTF-7 |
| 82 |
mailbox name is not an 8-bit mailbox name, hence the clause "interpret any |
| 83 |
8-bit mailbox names ... as UTF-8" does not apply. Mark writes: |
| 84 |
|
| 85 |
=== Clarification |
| 86 |
_by Mark Crispin_ |
| 87 |
|
| 88 |
8-bit octets are prohibited in mailbox names. Clients MUST use 7-bit |
| 89 |
names, and servers MUST reject CREATE commands that contain 8-bit |
| 90 |
octets. |
| 91 |
|
| 92 |
However, clients MUST also interpret any 8-bit names in a list of |
| 93 |
mailbox names (from LIST or LSUB) as UTF-8. |
| 94 |
|
| 95 |
To understand the history here, we must go back to the 1990s where |
| 96 |
people (in spite of being told not to do so) were writing IMAP2 clients |
| 97 |
and servers which used ISO-8859-1 and Shift-JIS mailbox names. At that |
| 98 |
time, it was by no means certain that UTF-8 would become the standard |
| 99 |
Internet character set; I played an important role in making that |
| 100 |
happen, but that was still a few years in the future. |
| 101 |
|
| 102 |
The adoption of UTF-8 offered a chance to exterminate non-UTF-8 8-bit |
| 103 |
mailbox names, and in 1996 the current rules were adopted. The |
| 104 |
transition to IMAP4 (which required substantial changes to any IMAP2 |
| 105 |
servers) provided an opportunity to exterminate these non-interoperable |
| 106 |
names once and for all. |
| 107 |
|
| 108 |
The modified UTF-7 was a temporary expedient to allow non-ASCII mailbox |
| 109 |
names while remaining with the 7-bit framework. Had punycode existed at |
| 110 |
the time, it would have been a much better choice than UTF-7. But |
| 111 |
punycode did not exist for several years later with IDN. In fact, |
| 112 |
punycode was created because people learned the problems of UTF-7 from |
| 113 |
IMAP. |
| 114 |
|
| 115 |
The intent was always to move to a UTF-8 only environment and leave |
| 116 |
behind UTF-7. When that happens, clients will start encountering UTF-8 |
| 117 |
names. It is therefore necessary to tell clients that, even though they |
| 118 |
are not permitted to send them, they need to be written to handle them |
| 119 |
so they work properly when the restriction is relaxed in the future. |
| 120 |
|
| 121 |
=== Recommendations |
| 122 |
_by Mark Crispin_ |
| 123 |
|
| 124 |
*Options for server implementors* |
| 125 |
|
| 126 |
From the perspective of a server implementor, you have one of two choices |
| 127 |
of how to implement MUTF-7: |
| 128 |
footnote:[editor's note: Modified UTF-7 as specified by the ensemble of RFC-2152 and RFC-3501] |
| 129 |
|
| 130 |
[horizontal] |
| 131 |
[S1]:: Ignore it; just forbid 8-bit octets in the CREATE command. |
| 132 |
[S2]:: Convert mailbox names in commands from MUTF-7 to UTF-8. When doing a |
| 133 |
LIST or LSUB, convert mailbox names from UTF-8 to MUTF-7 before sending |
| 134 |
them to the client. |
| 135 |
|
| 136 |
Servers of type [S1] were far more common in the 1990s. [S2] is more |
| 137 |
common today. However, a client neither knows, nor cares, which type of |
| 138 |
server it is because the rules make both servers interoperate the same. |
| 139 |
|
| 140 |
*Options for client implementors* |
| 141 |
|
| 142 |
[horizontal] |
| 143 |
[C1]:: Ignore it; you're an ASCII client. |
| 144 |
[C2]:: Convert mailbox names from UTF-8 to MUTF-7 when sending a command. |
| 145 |
When receiving a listing of mailboxes, convert MUTF-7 to UTF-8. |
| 146 |
|
| 147 |
This all works, and works well. The routines to do the conversions are |
| 148 |
quite straightforward. The only thing that you can't do well are mixed |
| 149 |
wildcards with strings with non-ASCII names; and that is primarily a |
| 150 |
curiousity since no clients do that with ASCII names. |
| 151 |
|
| 152 |
== Unicode, UCS-2, UTF-16, and UTF-7 |
| 153 |
|
| 154 |
.Incomplete specification: |
| 155 |
WARNING: This section and its subsections are not normative references, |
| 156 |
and are insufficient to implement UCS-2, UTF-16 or UTF-7 based |
| 157 |
software. |
| 158 |
|
| 159 |
=== UCS-2 and UTF-16 |
| 160 |
_by Mark Crispin_ |
| 161 |
|
| 162 |
RFC-3501 uses http://tools.ietf.org/html/rfc2152[RFC-2152] by reference. |
| 163 |
Some of the confusion on the getmail list arose from the fact that |
| 164 |
RFC-2152 talks about UCS-2 representation, which is limited to the Basic |
| 165 |
Multilingual Plane (BMP) range U+0000 to U+FFFF. |
| 166 |
|
| 167 |
However, RFC-2152 also (page 5) refers to the handling of surrogate |
| 168 |
pairs, which are defined in UTF-16 but not UCS-2. |
| 169 |
|
| 170 |
The correct interpretation is that the wording in RFC-2152 was written |
| 171 |
at a time when "UCS-2" was interpreted as a synonym for "16-bit value" |
| 172 |
as opposed to "BMP-only codepoints". This happens frequently in older |
| 173 |
standards. Since UTF-7 is deprecated, nobody has done the work to |
| 174 |
update RFC-2152 to clarify this point. |
| 175 |
|
| 176 |
Using surrogate pairs extends the capability of 16-bit words beyond the |
| 177 |
BMP range. |
| 178 |
|
| 179 |
The 0x0000 to 0xFFFF range comprises so-called surrogates, two character |
| 180 |
ranges (0xD800 to 0xDBFF and 0xDC00 to 0xDFFF) of 1024 characters (2^10^) |
| 181 |
each. These ranges are technically removed from the BMP (thus there is |
| 182 |
no such thing as U+D800); and hence the BMP only contains 64,512 |
| 183 |
possible codepoints. |
| 184 |
|
| 185 |
Both UTF-7 and UTF-16 transformation leverages these ranges to map |
| 186 |
Unicode code points in the range from U+010000 to U+10FFFF (which is the |
| 187 |
highest Unicode code point) to a pair of UCS-2 characters in the |
| 188 |
surrogates ranges. |
| 189 |
|
| 190 |
This happens by first subtracting 0x10000, which maps the input into the |
| 191 |
range 0x0 to 0xFFFFF, representable in 20 bits. The most significant |
| 192 |
10-bit portion is mapped into the range 0xD800…0xDBFF, the least |
| 193 |
significant 10-bit portion into the range 0xDC00…0xDFFF, and these two |
| 194 |
16-bit values are used in this order. UTF-7 does a further step of |
| 195 |
encoding in modified BASE64. |
| 196 |
|
| 197 |
Thus, UTF-7 and UTF-16 both deal with ``16-bit values'' and use the same |
| 198 |
surrogate pair mechanism to access non-BMP codepoints. Although not |
| 199 |
strictly accurate (the two are technically independent encodings of |
| 200 |
Unicode), it may be helpful to think of UTF-7 as a further encoding of |
| 201 |
UTF-16. |
| 202 |
|
| 203 |
=== UTF-7 |
| 204 |
|
| 205 |
UTF-7 is a 7-bit representation of Unicode that makes use of character set |
| 206 |
shifting. A character that is directly representable represents itself. Other |
| 207 |
characters are subjected to a modified BASE64-encoding (that omits the padding |
| 208 |
"=" characters at the end of a group) which is preceded by a "+" character |
| 209 |
and trailed by a "-" character, which is discarded, or any other character |
| 210 |
not in the modified BASE64 set, which remains in the stream. |
| 211 |
|
| 212 |
As a special case, the sequence "\+-" is a shorthand to represent |
| 213 |
the "+" character itself. |
| 214 |
|
| 215 |
The modified BASE64 character set uses the characters A-Z, a-z, digits 0-9, |
| 216 |
and the characters "+" and "/", omitting "=" to avoid collisions with |
| 217 |
RFC-2047 encoding. |
| 218 |
|
| 219 |
=== Modified UTF-7 |
| 220 |
|
| 221 |
This works similar to UTF-7, but mandates that printable ASCII characters |
| 222 |
0x20...0x7E except 0x26 (the ampersand "&") represent themselves, and uses yet |
| 223 |
another BASE64 alphabet consisting of the upper- and lowercase letters, the |
| 224 |
digits, and the characters "+" and ",", with some further rules specified in |
| 225 |
RFC-3501. The leading shift character is replaced by the ampersand "&", |
| 226 |
the trailing remains "-", and the "&" can be encoded as "&-". |
| 227 |
|
| 228 |
== Conclusions |
| 229 |
|
| 230 |
IMAP Clients that want to support international mailbox names should send UTF-7, |
| 231 |
and be prepared to handle UTF-7 (if no 8-bit data is found) and UTF-8 (if |
| 232 |
8-bit data is found). |
| 233 |
|
| 234 |
Modified UTF-7 as per the IMAP RFC #3501 is not limited to the Unicode Basic |
| 235 |
Multilingual Plane, but maps the entire Unicode range. |