[kaffe] Slow byte to char conversion

Dalibor Topic kaffe@rufus.w3.org
Tue, 29 Aug 2000 17:25:52 +0200


Hi Godmar,

Am Die, 29 Aug 2000 schrieb Godmar Back:
> I was looking at this function in String.java:
> 
> ----
> private static StringBuffer decodeBytes(byte[] bytes, int offset,
>                 int len, ByteToCharConverter encoding) {
>         StringBuffer sbuf = new StringBuffer(len);
>         char[] out = new char[512];
>         int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
>         while (outlen > 0) {
>                 sbuf.append(out, 0, outlen);
>                 outlen = encoding.flush(out, 0, out.length);
>         }
>         return sbuf;
> }
> ----
> 
> Why can't this function be rewritten to read:
> 
> ----
> private static StringBuffer decodeBytes(byte[] bytes, int offset,
>                 int len, ByteToCharConverter encoding) {
>         char[] out = new char[len];
>         int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
> 	return new StringBuffer(outlen).append(out, 0, outlen);
> }
> ----
> 
> Is it not fair to assume that converting n bytes will result in less than
> or equal to n characters?

For most of encodings that I've seen, it is a safe assumption.
Unfortunately, I haven't seen 'em all :) 

I'm suspicious that it's
possible to have a byte encode several characters. And here is why:
Unicode supports "combining" characters. These characters are used to
modify other characters. For example, you can add accents to
normal characters. Since Unicode is designed to allow easy conversion
to/from existing character sets, it includes many precomposed
characters, like the german umlauts ä,ö,ü. You'd still need combining
characters to fully represent some scripts, like Thai. Markus
Kuhn says in his "UTF-8 and Unicode FAQ for Unix/Linux" [1] : "with
the Thai script, up to two combining characters are needed on a single
base character. "

In his article on "Forms of Unicode" [2], Mark Davis shows some of the
myths about characters vs code points vs code units. It features a
table with some unexpected things. There is an encoding for the fi
ligature, for example [3]. Some arabian characters' Unicode
representation depends on the context.  Some characters require
several Unicode characters to be represented properly: "The Devangari
syllable ksha is represented by three code points."

I haven't seen an encoding for Devangari, so I don't know whether the
encoding for "ksha" would be less than three bytes. I've seen other
encodings (doing research for this post today), collected by Mark
Leisher as a supplement to the official Unicode conversion tables. And
some of them, like I3342, encode a single byte into several characters
[4]. I don't think any of these encodings is supported by Sun's JDK 1.3,
though.

To sum it up: I'm not convinced. I guess taking a look at GNU
libc iconv functionality should provide some more insight, but I don't
have the sources around right now. The GNU libc folks have done a
massive job supporting a variety of encodings, so this might be another
direction to look for advice..

Read ya,

Dali

[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html
[2] ftp://www6.software.ibm.com/software/developer/library/utfencodingforms.pdf
[3] \uFB01 according to Unicode-Data-3.0.txt
[4] 0xA4	 -> 0x0631 0x064A 0x0627 0x0644  for PERSIAN RIAL SIGN


__________________________________________________
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com