[kaffe] Slow byte to char conversion

Dalibor Topic kaffe@rufus.w3.org
Tue, 29 Aug 2000 21:13:08 +0200


Am Mon, 28 Aug 2000 schrieb Artur Biesiadowski:
> Godmar Back wrote:

> I've looked at this and I don't see a reason for CharToByteConverter to
> go through encode/flush stes - it would work perfectly all right with
> single step method, returning new byte[] for example. For

I think there are two possible issues:
a) Unicode characters followed by combining characters
\u0041\u0308 is actually just another way to encode 'Ä' (\u00C4). You
get the same reason for saving state as with multibyte encodings: you
don't know for sure what you've read unless you've read the last bit of
it.

b) Performance
With multibyte encodings, it can be hard to determine the size of the
byte array in advance. So you'd have to do the encoding into a
temporary byte array, and then create a new one, with the right size,
and copy the bytes, before you return it. If all the caller does is to
copy the bytes again into appropriate positions in his byte array, then
you'd be doing a lot of useless work. It might be interesting to see
how char to byte conversion is used in kaffe.

Having flush functionality allows you to stop encoding when you run
out of space in the byte array. You can save the unencoded rest in the
encoder and throw an exception/continue with unencoded characters next
time your conversion routine is called.

Unfortunately, unless you can guarantee that you'd never run out of
space on the byte buffer (which you can't with the current interface),
every stateless converter becomes stateful in a sense that it needs to
carry around unconverted remainders waiting to be flushed. I'm starting
to realize that there are some (undocumented, of course) pitfalls in the
current design of converters which are harder to get around than I
thought. Unless of course ...

> ByteToCharConverter things are a bit different, as streams can stop
> inside multibyte encoding. It could be workaround by changing interface
> a bit and allowing converter to ruteurn number of rejected bytes, which
> would have to be fed to it again on next call. This moves need to
> remember state to OutputStreamWriter and it is ok as it will
> synchronized itself.

Sun's "undocumented" [1] sun.io.CharToByteConverter supports something
like that: you can get the index just past the last converted
character, and by comparing it with the supplieed arguments, figure out
that not everything got converted. A similar interface exists for
sun.io.ByteToCharConverter.

I think your idea to delegate responsibility for state management to
synchronized methods in calling objects could be an elegant way to make
converters stateless.

Dalibor Topic

[1] As Sun don't document their sun.* packages, there is no API
spec to work from [2].

But there is a document on Sun's website which describes
the internals of character set conversion for JDK 1.1. It's marked as
deprecated but contains a nice description of some implementation
details: http://java.sun.com/products/jdk/1.1/intl/html/intlspec.doc7.html

DIGITAL's JDK 1.1.3 includes documentation for these "undocumented" I/O
classes. It's online at http://infoshako.sk.tsukuba.ac.jp/InfoRes/jdoc/Languages/Java/digital-java/api/sun.io.CharToByteConverter.html

[2] But there is a working group with Doug Lea, people from IBM, Sun
and some other companies tryng to define some new I/O APIs for the next
Java release. It's just started within the Java Community Process. They
plan to specify an API for character set conversion. That's good news,
I guess.


__________________________________________________
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com