[kaffe] Slow byte to char conversion

Dalibor Topic kaffe@rufus.w3.org
Mon, 28 Aug 2000 10:01:14 +0200


oops! somehow an unfinished version of this e-mail escaped from from my
computer. here's the full text:

Hi Godmar,

sorry for the delay, but I was on holidays last week, and away from my
mail.

Am Sam, 19 Aug 2000 schrieben Sie:
> From what I understand, and someone correct me if I'm wrong,
> there shouldn't be any reason not to include the change you suggest -
> if someone implements it, of course.

Done. I have a patched version of Encode.java. I 'll
clean it up when a definite solution has stabilized.

> If I understand your proposal right, you'd use an array for
> the first 256 values and a hashtable or something like that 
> for the rest.  I don't think there would be a problem with changing 
> it so that it would both serialize an array and a hashtable.
> One or two objects in *.ser shouldn't make a difference. 

Yes. It should work nicely for ISO-8859 based encodings, and 
then for some. 

Actually, for byte to char conversion you don't even
need a hash table, since all ISO-8859-X assign unicode chars 
(simply speaking) to byte values in the range 0-255. 

For the reverse way (char to byte conversion) I'd need to do some
experiments to figure out a better way. In most character to byte
encodings, there is no single range from character x to character y
where all characters are mapped from. So the array based approach is
space-inefficient. A combination of arrays and hashmaps might be
interesting. But for the time being, I'm playing around with
java.io.InputStreamReader, so I'm trying to fix byte to char conversion
first.

> You could even stick a flag at the beginning if the array shouldn't
> pay off for some encodings.

I'd prefer a more class hierarchy based approach. We already have
kaffe.io.ByteToCharHashBased. We could have ByteToCharArrayBased, too.
Something like this (warning: untested code ahead):

abstract public class ByteToCharArrayBased extends ByteToCharConverter {

	// map is used to map characters from bytes to chars. A byte
	// code b is mapped to character map[b & 0xFF].
  	private final char [] map;

	public ByteToCharArrayBased ( char [] chars) {
		map = chars;
	}

	public final int convert (byte[] from, int fpos, int flen,
char[] to, int tpos, int tlen) {
		// Since it's a one to one encoding assume that
		// flen == tlen.
		for (int i = flen; i > 0; i --) {
			to[ tpos++] = convert( from [ fpos++ ]);
		}
		return flen;
	}

	public final char convert (byte b) {
		return map[b & 0xFF ];
	}

	public final int getNumberOfChars(byte [] from, int fpos, int
flen) {
		return flen;
	}
}

Now a (byte to char) conversion class has three choices:
a) it uses all byte values from 0-255 -> it extends
ByteToCharArrayBased, and makes the constructor use the
appropriate char array.
b) the encoded byte values are sparsely distributed through the range
of all legal byte values -> it extends ByteToCharHashBased due to
its space efficiency.
c) there is a huge block of bytes used in the encoding, but there are
also many bytes outside that block's range used in the encoding -> it
extends ByteToCharConverter and uses fields for ArrayBased as well
HashBased conversion. The convert method checks whether a byte is
within the block and uses the array, or uses the hash table otherwise.

From my experience (converting ISO-8859-X encodings from being hash
based to being array based), for byte to char conversion option (a)
takes little memory (256 chars for the table) and is very fast. As I
explained in my previous post, it beats option (b) in time-efficiency.
I suppose it beats it in space-efficiency as well, as long as most
bytes are convertable into characters.

When there are only a few legal byte values, which can be encoded into
characters, the hash based conversion could be more space-efficient. On
the other hand, the array based implementation doesn't waste much
memory in that case. Even in the worst case, a fictive encoding that
solely uses a specific single byte to encode some character, there are
255 * 2 = 510 bytes wasted. That's not much, and can be improved upon,
by introducing range checks and similar techniques.

The choices really start to matter when you're going the other way
round, from chars to bytes. Take a look at ISO-8859-8 (a.k.a. hebrew).
It encodes 220 characters. Of these 220, only 32 are *not*
mapping a byte value to itself. They are either mappings into the
range between \u05D0 and \u05EA, mappings within the first 256
characters, or mappings to a few special characters like LEFT-TO-RIGHT
MARK.

So what you could do, is having two arrays for the ranges
\u0000-\u00FF and \u05D0-\u05EA, and a hashtable for the rest. Or you
could have one array (\u0000-\u00FF) and a hashtable. Or just a
hashtable. Or a massive array if all you care about is time-efficiency.

If the encoder is clever enough, it could realize there is a simple
relation between the characters in the hebrew part of the Unicode and
their byte values: byte value = unicode value - 0x04F0, and use that to
eliminate the need to explicitely store this part of the encoding. This
leaves us with 5 Unicode-to-byte mappings that need to be stored in a
very space efficient implementation. Unless, of course there is some
other relation which can be exploited.

On a side note, it might be interesting to look at how GNU recode
handles such things for further inspiration, if you're looking for
more ideas.

I've come up with switch based converters, if you really need to sqeeze
as much space as possible:

public final static byte convert (char c) {
	// check if character c is in part of Unicode used in this
	// encoding
	if ((this_encodings.first_range.MIN_VAL <= c 
	  && c <= this_encodings.first_range.MAX_VAL)
	|| (this_encodings.second_range.MIN_VAL <= c
	  && ...) 
	|| ...)
	{
		switch (c) {
		// Only create cases for characters that are not mapped
		// to themselves/ via a simple relation.
 		case '...':
			return ...;
		// the default case: c is mapped to itself.
		// or you could implement some other relation here.
		default:
			return c;
		}
	}
	else {
		return unknown_character;
	}

As an added bonus, object creation with switch based encoders is cheaper
that with other forms. You don't need fields, so you don't
need to serialize anything. The conversion methods can be static to
improve performance.

The question I'd like to ask is: What would you optimize for: time or
space?

> One would have to see what the actual sizes of the
.ser files would be; > keeping those small is certainly desirable.  From
what I understand, > they're more compact than any Java code
representation. > Edouard would know more since he wrote that code, I

My patched version of Encode.java currently generates java source code,
which gets regularly compiled with kjc. The generated files sizes are 

  2433 Wed Aug 16 13:44:02 GMT+0:00 2000 kaffe/io/ByteToCharISO8859_2.class

vs.

  2274 Wed Aug 16 13:44:12 GMT+0:00 2000 kaffe/io/ByteToChar8859_2.ser

compiled using kaffe's current kjc with standard settings. Using -O
with kjc results in a 2277 bytes large class file. That's about the
same.

think. >  >  > On a related note, this whole conversion thing stinks. >
Why can't people stick to 7-bit ASCII? > For instance, the JVM98 jack

The whole conversion issue is an example of historical cruft. Over the
time, people toyed around with different encodings, because a) ASCII
left them 128 character slots to play with on 8 bit machines, and b)
ASCII is just not good enough if you want to write more demanding
languages (in terms of character sets), like Greek, Hebrew, German,
Chinese [1].

Whatever. The world is hopefully going towards Unicode now, but there
is a massive amount of legacy encoded files in whatever their native
encoding is. And people want to exchange them, subversive as they are
:) 

MIME has hooks for encodings, HTTP has hooks for them, and Java is the
first popular programming language that realizes that the world doesn't
*need* to write its files in latin script to use computers and
provides somewhat convenient means to handle the issues. 

Just be glad we don't have to handle various transliterations of
non-english scripts into ASCII, which are used by people without good
keyboard layouts. :) 

For example, how do you type the german letter
o-Umlaut, which is an o with two dots on its top in ASCII ? I've seen
o" (LaTeX), &ouml; (HTML), o: (arty people), oe (usually found in
e-mail from people who can't be bothered learning another keyboard
layout's definitions for special characters) . I could imagine 'o' or
°o° being used by somebody out there, too. :)

 benchmark calls PrintStream.print > a whopping 296218 times in a
single run.  Every call results in a new  > converter object being
newinstanced, just to convert a bunch of bytes.  > (The new converter
was one of the changes done to make the > charset conversion
thread-safe.)  This is one of the reasons > why we're on this test some

If I understand the issues involved properly, you need to create new
instances of converter objects only if your converter has to conserve
its state between consecutive calls. That's only the case if you have a
multibyte encoding, like UTF-8, and you might not have provided enough
bytes to fully decode the character in the last call.

Unfortunately, kaffe's encoding loading mechanism in
kaffe.io.ByteToCharConverter and kaffe.io.CharToByteConverter
always creates a new instance of an encoding. It should not have to do
so for one-to-one encodings. I'd like to fix that, too.

7 or 8 times slower than IBM. > And that's not even using any of the
serialized converters, just  > the default one (which is written in
JNI). > 

Why was a JNI based conversion chosen for ISO-Latin-1 in the first
place? ISO-Latin-1 one essentially converts each byte into itself as a
Unicode character. So almost everything the conversion routine does is
to copy the bytes around. Is kaffe's interpreter/jit/jit3 slower than
the JNI call with all its overhead?

If you've made it through till here, I'd like to hear from you :)

Dali

[1] Personally, I think people should use the phonetical encoding
instead. 

Imagine that! A world without spellcheckers! You'd pronounce
the items on the menu of a chinese restaurant properly!

Still, the current problem of converting the documents written in
traditional scripts, like ASCII, would remain. Back to square one.


__________________________________________________
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com