[kaffe] Slow byte to char conversion
Dalibor Topic
robilad at yahoo.com
Mon Aug 28 01:01:14 PDT 2000
oops! somehow an unfinished version of this e-mail escaped from from my
computer. here's the full text:
Hi Godmar,
sorry for the delay, but I was on holidays last week, and away from my
mail.
Am Sam, 19 Aug 2000 schrieben Sie:
> From what I understand, and someone correct me if I'm wrong,
> there shouldn't be any reason not to include the change you suggest -
> if someone implements it, of course.
Done. I have a patched version of Encode.java. I 'll
clean it up when a definite solution has stabilized.
> If I understand your proposal right, you'd use an array for
> the first 256 values and a hashtable or something like that
> for the rest. I don't think there would be a problem with changing
> it so that it would both serialize an array and a hashtable.
> One or two objects in *.ser shouldn't make a difference.
Yes. It should work nicely for ISO-8859 based encodings, and
then for some.
Actually, for byte to char conversion you don't even
need a hash table, since all ISO-8859-X assign unicode chars
(simply speaking) to byte values in the range 0-255.
For the reverse way (char to byte conversion) I'd need to do some
experiments to figure out a better way. In most character to byte
encodings, there is no single range from character x to character y
where all characters are mapped from. So the array based approach is
space-inefficient. A combination of arrays and hashmaps might be
interesting. But for the time being, I'm playing around with
java.io.InputStreamReader, so I'm trying to fix byte to char conversion
first.
> You could even stick a flag at the beginning if the array shouldn't
> pay off for some encodings.
I'd prefer a more class hierarchy based approach. We already have
kaffe.io.ByteToCharHashBased. We could have ByteToCharArrayBased, too.
Something like this (warning: untested code ahead):
abstract public class ByteToCharArrayBased extends ByteToCharConverter {
// map is used to map characters from bytes to chars. A byte
// code b is mapped to character map[b & 0xFF].
private final char [] map;
public ByteToCharArrayBased ( char [] chars) {
map = chars;
}
public final int convert (byte[] from, int fpos, int flen,
char[] to, int tpos, int tlen) {
// Since it's a one to one encoding assume that
// flen == tlen.
for (int i = flen; i > 0; i --) {
to[ tpos++] = convert( from [ fpos++ ]);
}
return flen;
}
public final char convert (byte b) {
return map[b & 0xFF ];
}
public final int getNumberOfChars(byte [] from, int fpos, int
flen) {
return flen;
}
}
Now a (byte to char) conversion class has three choices:
a) it uses all byte values from 0-255 -> it extends
ByteToCharArrayBased, and makes the constructor use the
appropriate char array.
b) the encoded byte values are sparsely distributed through the range
of all legal byte values -> it extends ByteToCharHashBased due to
its space efficiency.
c) there is a huge block of bytes used in the encoding, but there are
also many bytes outside that block's range used in the encoding -> it
extends ByteToCharConverter and uses fields for ArrayBased as well
HashBased conversion. The convert method checks whether a byte is
within the block and uses the array, or uses the hash table otherwise.
More information about the kaffe
mailing list