SC22/WG20 N795

Known problems with case mapping tables in ISO/IEC TR 14652

From: Kenneth Whistler [kenw@sybase.com]
Sent: Monday, October 23, 2000 11:10 PM

Ann,

> My very strong preference is that TR 14652 be brought in line with
> UnidcodeData.txt, except for explicable differences.

I think that WG20 should take this as a very clearly expressed preference.

I have raised this issue in WG20 on more than one occasion, most
recently at the Québec meeting in May. Unfortunately, Keld has bristled
and balked on each occasion, giving me little confidence that this
is something that WG20 can actually accomplish.

The claim, in the most recent instance, was that the LC_CTYPE data
had been "checked" already, by which I inferred that it had been
run through a POSIX-compliant parser and had no syntax errors in it,
but not that it had been validated against UnicodeData.txt, so that any
differences could either be corrected or explained. In fact, Keld
was very passionate on this issue, claiming that "for case mapping the
data should be *ours* [i.e. SC22's] and [that] the UTC has no business
defining it." I, on the other hand, feel that ISO technical committees
should be in the business of producing consensus standards, and if
there is preestablished practice in wide use by many implementers and
vendors, there is a pretty good prima facie case for using that
demonstrated consensus as the basis for developing any standard in
that particular technical area. At any rate, we will argue this once
again at the WG20 meeting next week.

> In order for COBOL to use UnicodeData.txt, the database would have to be
> copied into the COBOL standard and probably made reader-friendly, adding to
> the size of the COBOL standard and introducing the opportunity for errors.
> I would like to avoid that.  The worst effect is that it would subject the
> content to comment from a  review audience having a lot less expertise than
> WG20 and the Unicode consortium; this can only result in delay for COBOL
> even if all the comments are invalid.

I agree that this would be an undesirable outcome. And, if we can solve
the particular problem of the case mapping tables for DTR 14652, this
one reference problem for the COBOL standard can be avoided.

However, I think this is only the tip of the iceberg for SC22 programming
languages dealing with 10646/Unicode. The fact is that significant,
widely implemented consensus standards regarding implementation
details of 10646/Unicode (e.g., the bidi algorithm, the normalization
algorithm, line breaking property tables, name preparation for IDN,
etc.) are being developed by non-JTC1 standardization organizations
like the UTC, the IETF, or the W3C. It is just not feasible (or
desirable) for WG20 to try to replicate this work into ISO standards
or to try to create competing ISO standards in the same area. So one
of these days the SC22 language committees and JTC1 are going to
have to come to grips with how to reference and make use of consensus,
implemented standards that don't happen to have ISO labels on them.

> 
> From my limited viewpoint, I cannot imagine inexplicable differences
> between TR 14652 and UnicodeData.txt.  I had understood there were errors
> resulting from the difficulty of ensuring consistency between the two.  Now
> there are edge cases.  I need to understand what these are and why there
> are differences.  

Well, I guess I am going to have to catalog the entire list for the
WG20 meeting, since it is apparent that the editor of DTR 14652 is
not going to do so.

Issues that I know of right now:

1. (major) The LC_CTYPE definition is based on the Unicode 2.0 repertoire,
   which is now 4 years old. The implementers are moving on to Unicode 3.0,
   and at this point, it makes sense for any LC_CTYPE definition for a
   not-yet-finished TR to be based on the *current* ISO/IEC 10646-1:2000
   publication. One obvious omission that ought to give Europeans pause:
   U+20AC EURO SIGN is not included in the current 14652 table.

2. The table omits an uppercase mapping for 0131 and a lowercase
   mapping for 0130. This was a deliberate choice by the editor, to
   avoid the "Turkish case mapping problem". But of course, the problem
   is not avoided by omitting it.

3. The table erroneously includes an uppercase mapping for Mkhedruli
   Georgian characters, which are caseless. This error has been pointed
   out before, but the editor does not want to change it, since it would
   result in asymmetrical case mappings for the Georgian alphabets.

4. A lowercase mapping for 019F is missing from the table.

5. The case pair for 01A6/0280 is not recognized, and is missing from
   both tables.

> It would be far better for COBOL to reference TR 14652
> and override the cases that might need to be different if there are such.

I believe that if DTR 14652 were updated to match UnicodeData.txt, the
only case that you might have to override would be the Turkish i's,
depending on how you want the equivalence classes for i's to work
out for COBOL identifiers.

> 
> For Bill Klein:   As currently specified, COBOL does not fold from or to
> the small dotless I, the capital dotted I, or the final sigma - using TR
> 14652 as a reference and folding tolower.
> 

> 
> >  No, 30 characters would mean a maximum of 120 bytes in the worst case,
> >  since all encodable characters are guaranteed to be in the range
> 
> COBOL is counting in "code units".  COBOL uses the term "character
> position", which means the same.  Each of the code elements of a combining
> sequence are one character position.  My IBM source on Unicode has said
> that the industry direction for Unicode data is to treat each "character"
> of a combining sequence as a separate character. 

I think there is still a confusion of terminology here. The last statement
is correct: a combining character sequence is a sequence of characters,
and most processes that count "characters" would count each of the
characters of that sequence, rather than trying to do the high-level
analysis to determine a graphemic count.

However, "code unit" doesn't mean what you think it does. In Unicode-speak,
"code point" is equivalent to the COBOL usage "character position". That
is the term for the encoded character, regardless of the number of bytes
required to express that character in a computer register in a particular
encoding form.

Code unit, on the other hand, refers to the integral data unit used
as the basis for expressing the character in a particular encoding form.

Here are three examples, to illustrate the distinctions.

U+0041 LATIN CAPITAL LETTER A

   U+0041      the code point of the character (its encoding)
   0x41        UTF-8 encoding form: 1 code unit (byte)
   0x0041      UTF-16 encoding form: 1 code unit (wyde)
   0x00000041  UTF-32 encoding form: 1 code unit (word)

U+4E00 CJK UNIFIED IDEOGRAPH-4E00 (the Chinese character for 'one')

   U+4E00      the code point of the character (its encoding)
   0xB4 0xB8 0x80  UTF-8 encoding form: 3 code units (byte)
   0x4E00          UTF-16 encoding form: 1 code unit (wyde)
   0x00004E00      UTF-32 encoding form: 1 code unit (word)

U+1D103 MUSICAL SYMBOL DOUBLE SHARP

   U+1D103     the code point of the character (its encoding)
   0xF1 0x9D 0x84 0x83  UTF-8 encoding form: 4 code units (byte)
   0xD874 0xDD03        UTF-16 encoding form: 2 code units (wyde)
   0x001D013            UTF-32 encoding form: 1 code unit (word)

> I thought being
> consistent for identifiers would be best for users.  The COBOL spec doesn't
> say how many bytes this takes.  Each implementor can work this out for the
> codeset(s) supported for source code.  Most current COBOL implementations
> count bytes for Kanji characters in identifiers, but I don't want that in
> the standard.

My original assumption about Japanese implementations is that they
would be counting bytes for double-byte characters. And that is why I
assumed you would want to be counting bytes (= code units) for support
of UTF-8 in identifiers in COBOL.

But if the intention is to define the extensions for identifiers in
COBOL in terms of *characters* (character positions) instead of bytes,
despite existing COBOL implementations for Japanese, then the right
answer for UTF-8 as well would be to count characters (code points) --
the conclusion that I reached after the current intentions for the
COBOL standard in this regard were explained.

So once again, I will state the consequences for implementing Unicode
in COBOL identifiers under that assumption.

For UTF-8, the maximum storage needed for 30 characters is 120 bytes. 
           (4 bytes each)
For UTF-16, the maximum storage needed for 30 characters is 120 bytes. 
           (2 2-byte wydes each)
For UTF-32, the maximum (and minimum) storage needed for 30 characters
           is 120 bytes (1 4-byte word each).

Easy, no?

--Ken

> 
> Thanks for all the time you're devoting to COBOL.
> 
> _
> Ann Bennett
	2