[kaffe] Slow byte to char conversion

Godmar Back kaffe@rufus.w3.org
Mon, 28 Aug 2000 16:00:38 -0600 (MDT)



Dali,

I was looking at this function in String.java:

----
private static StringBuffer decodeBytes(byte[] bytes, int offset,
                int len, ByteToCharConverter encoding) {
        StringBuffer sbuf = new StringBuffer(len);
        char[] out = new char[512];
        int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
        while (outlen > 0) {
                sbuf.append(out, 0, outlen);
                outlen = encoding.flush(out, 0, out.length);
        }
        return sbuf;
}
----

Why can't this function be rewritten to read:

----
private static StringBuffer decodeBytes(byte[] bytes, int offset,
                int len, ByteToCharConverter encoding) {
        char[] out = new char[len];
        int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
	return new StringBuffer(outlen).append(out, 0, outlen);
}
----

Is it not fair to assume that converting n bytes will result in less than
or equal to n characters?

    - Godmar

> 
> 
> Am Mon, 28 Aug 2000 schrieb Artur Biesiadowski:
> 
> > And why exactly default converter could not be cached and same instance
> > used for all conversions ? I think it is stateless class, so it should
> > be safe to enter same object method from various threads with all state
> > on stack.
> 
> It depends on the encoding. Let's say you have a multibyte encoding,
> where several bytes encode a single character, like UTF-8 [1]. You
> can't guarantee that all the byte arrays that you want to encode into
> char arrays terminate on character boundaries. So you need to be
> able to save the state of your converter and pick up at the position
> where you left next time your converter is called.
> 
> Imagine that you're reading in a UTF-8 encoded file, and get an
> IOException while you're reading it. You convert as much as you've
> read, but you can't decide on the last character, since your stream has
> been interrupted. The UTF-8 converter saves its state, and waits for
> bytes to convert to characters.
> 
> Now, imagine another thread tries to do some UTF-8 input
> conversion, too. If it used the first converter, it would get a
> corrupted result, since the first converter is still waiting for bytes
> to continue converting. So you have to use a fresh UTF-8 converter for
> that.
> 
> You could say: "So? Kaffe uses ISO-Latin-1 as default encoding. That's
> stateless.". But unfortunately the default encoding comes from the
> file.encoding system property, which can be changed by the user [2].
> Don't rely on the default encoding being ISO-Latin-1.
> 
> Kaffe does some sort of caching already, but it instantiates
> a new converter every time one is needed, which is not necessary for
> stateless converters, as you've pointed out.
> 
> [1] If you have a Linux installation around, take a look at
> /usr/share/i18n/charmaps/UTF8. It might have a slightly different name
> on your installation, though, since character encodings usually have
> several aliases. 
> 
> [2] Well, sort of. While Java 2 allows system properties to be set,
> kaffe has not caught up with that yet, as far as I know. So the only
> way I know of to change the default encoding is to modify it in
> libraries/clib/native/System.c and recompile kaffe.
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Talk to your friends online with Yahoo! Messenger.
> http://im.yahoo.com
>