[kaffe] Slow byte to char conversion

Dalibor Topic kaffe@rufus.w3.org
Thu, 31 Aug 2000 00:03:13 +0200


Am Die, 29 Aug 2000 schrieb Dalibor Topic:

> > Is it not fair to assume that converting n bytes will result in less than
> > or equal to n characters?
> 
> For most of encodings that I've seen, it is a safe assumption.
> Unfortunately, I haven't seen 'em all :) 
> 
> I'm suspicious that it's
> possible to have a byte encode several characters.

I digged around Unicode.org today, to see if I can find some interesting
mappings from native character sets to Unicode which violate that
assumption. I've found the Devagari and Farsi encodings from Apple.

Here is an example from MacFarsi, the character set used to encode
Persian. It's online at:
http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/FARSI.TXT

#   For example, the mapping of 0x2B is given as <LR>+0x002B; the
#   mapping of 0xAB is given as <RL>+0x002B. If we map an isolated
#   instance of 0x2B to Unicode, it should be mapped as follows (LRO
#   indicates LEFT-RIGHT OVERRIDE, PDF indicates POP DIRECTION
#   FORMATTING):
#
#     0x2B ->  0x202D (LRO) + 0x002B (PLUS SIGN) + 0x202C (PDF)


So, in this case, a single Mac OS Farsi code point results in three
Unicode characters. It can actually get even worse:

#   In the TrueType variant of Mac OS Farsi, 0xA4 is a ligature for the
#   currency unit "rial". This is mapped using the grouping hint followed
#   by the Arabic characters for "rial"
#   
#     (TrueType variant) 0xA4 -> 0xF86B+0x0631+0x064A+0x0627+0x0644

Here you have a single code point encoded by five (5) Unicode
characters. The grouping hint seems to be a vendor specific extension
from Apple, though. That's still 4:1.

Sun doesn't seem to have included any Farsi or Devangari conversion
mechanisms, so kaffe doesn't really have to support such exotic
encodings. But ... it may one day. So I'd recommend staying on the safe
side.

Dali


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com