The Big E-UAE JIT blog: optimization

Showing posts with label optimization. Show all posts

Monday, July 28, 2014

PPCJITBETA04b (Hundreds and XThousands)

There is something new each day, I haven't done this before: a Fast Follow to the previous beta release.

Let me sprinkle the life of the suffering Amiga X1000 owners, here is a

PA6T optimized build for Amiga X1000

After I had done some investigation, it turned out that MickJT was right for a long time. We don't need unique changes for Amiga X1000 (for the P.A. Semi PA6T processor), because it is similar to the already supported G5 (PowerPC 970) processor.

So, I have decided that we must not delay the beta for Amiga X1000 any longer. This new build is exactly the same Beta #4 release feature-wise; which was done for the other platforms already. Except that it is compiled for AmigaOS4 with the same minor changes which increase the performance on G5 significantly.

Since the code base is the same I kept the same beta version tag with a postfix. For the first and the last time... Úttörő becsület szavamra! :)

Caveat emptor

This build is not fully optimized for PA6T processor, since there is no support for that processor target in the AmigaOS4 SDK yet. It is still running much better than the previous generic builds.
Hopefully the PA6T support will be resolved sooner or later in the SDK and then I will add any necessary changes to the E-UAE build too.

In the meanwhile the binary for this build target will be added to the upcoming releases.

Credz

Actually, the credit goes to the following good people:

Tobias Netzel - who implemented the G5 support and helped MickJT with advices;
MickJT - who was experimenting with the build for a long time already and pushed me to do this release;
Sven Ottemann & Sebastian Bauer - who helped me with the documentation for PA6T;
Tommysammy - who was kindly doing the testing.

Thanks guys!

Feedback

This was a blindfolded build again: I have no access to an Amiga X1000, so I couldn't test the build by myself.

I would like to invite all Amiga X1000 owner to give me some feedback regarding the performance or any issues what turns up with this special build.

You can find my email address at the end of the README file.

Wednesday, May 21, 2014

PPCJITBETA03 (Switch to Ludicrous Speed)

Welcome back, long time no see, my friend. Please be seated.

I can tell you I have some good news to you again: here comes Beta #3!

Get it here if you must:

https://sourceforge.net/projects/euaeppcjit/files/0.8.29-PPCJITBETA03

I held this release back for a while just to fix the emulated cache checksum feature, but I have been chasing a bug for two weeks without any success. So, that fix is postponed to the next beta, in the meanwhile you can enjoy some significant speed enhancement and increased stability.

Without going into the details regarding the changes (see the included README for all changes) I would like to mention the most important change:

Vroom-vroom

The major feature of this beta release is the register and flag optimization fix. You can turn it on in the configuration, just set comp_optimize to true.

If you are interested in the details I explained it already how the optimization works in an earlier post, but in case you are too lazy to read through that post: here is my old diagram (just because it is beautiful, you know):

Code translation flow diagram

Let me summarize it for you: the JIT compiler is collecting information about data-flow dependencies between the various macroblocks and tries to remove the ones which won't have any effect on the outcome of a certain block of macroblocks.
This is not a new feature in the JIT implementation, but previously a few (tons of) bugs prevented it from working on more complex codes than my Mandelbrot test.

In this release I have fixed every issue I have found so far with the optimization and it seems working quite nicely. You can boot the AmigaOS and it runs just fine, also games and demos will benefit from this feature too.
I was planing to do a comparison video where the speedup is clearly shown, but I haven't had too much time yet, so this is your job now, dear EUAEPPCJIT fans! Just post the links to the videos into the comments here. :)

PPC970 aka G5

Not everything is sunshine and happiness, though. Supporting G5 processor architecture target turned to be much more complicated than I thought, especially because I don't have any hardware to test on.

In the previous release the MacOSX G5 binary was not working properly on G5 (neither on any other PowerPC as matter of fact). Thanks to Luigi Burdo for the report and Tobias Netzel (again) for the help with the compiler. This is fixed in this release, hopefully. (Fingers crossed, I still don't have hardware to test on.)

While the situation with the MorphOS G5 version is not that hunky-dory: as it seems there is no official compiler with G5 support yet in the MorphOS SDK and it is rather complicated and unreliable to compile any source for that processor. Until this situation is improved the G5 version for MorphOS won't be available from the beta binaries.

However, nothing stop you from compiling your own version from the sources, as these are always available at SourceForge.

Upcoming

As I mentioned: I postponed the fix for the block checksum to the next release and also picked up some things to do. You can find the planned list here:

https://sourceforge.net/p/euaeppcjit/tickets/milestone/PPCJITBETA04

I also had a look on what is planned for the first stable release and moved some items around the various milestones. If you are curious just click at the milestones on the Sourceforge page.

Monday, November 4, 2013

On a collision course

So, where are we at with the recent update? 349 out of 387: no, it is not 100% yet, but a nice, fat 90.1%!

Quickly, what was done this time and let's jump to conclusions right after that:

Implementation of BFINS Dx,Da{y,z}, BFFFO Dx{y,z},Da, BFEXTS Dx{y,z},Da, BFEXTU Dx{y,z},Da, BFCLR Dx{y,z}, BFTST Dx{y,z}, BFSET Dx{y,z}, BFCHG Dx{y,z}, ROXR.x Dy,Dz, ROXR.W mem, ROXR.x #imm,Dy, ROXL.x Dy,Dz, ROXL.W mem, ROXL.x #imm,Dy, ASR.W mem, ASL.W mem, LSR.W mem, LSL.W mem, ROR.W mem, ROL.W mem instructions.
Opcode compiler fucntions for different operation sizes are unified, unnecessary helper functions were removed.
Macroblock protos are generated from the opcode table file rather than used the manually prepared file.
Fixed wrong function name for the CNTLZW PowerPC instruction emitter.
Fixed missing input register in C and X flag extraction macroblock which is used in some shift instructions.
Fixed register dependency in ASR.x Dy,Dz and LSR.x Dy,Dz instructions.

(Slightly unrelated: I have found a bug in the original E-UAE code for BFINS and flags, if I will have enough patience I will fix that. And another one in Petunia, that will be fixed too...)

Now, we are getting dangerously close to the beta stage. It is really hard not to give you guys any promises what I cannot keep later on... (Summer is coming, you know... ;)

Anyway, my plans for the near future (read: when it is done) are:

After I have finished with the implementation of all instructions which were anticipated earlier I am going to stabilize the emulator a bit and clean up the code if needed.
I will try to release a compiled beta version and put together some documentation for using/testing it.
There are some outstanding issues, one of the most critical one is fixing the problems with the macroblock optimizer.

So much for the crystal ball and now...

Something completely different

I have to admit that my posts were not that interesting to read recently. Earlier I invested more effort into the posts and it was probably more fun to read, especially to the developers.
I would like to bring back that tradition, so for this update I came up with a few interesting thoughts on:

How to avoid the decisions

I know that many developers love to procrastinate anything, including (but not limited to) decisions. But what I am about to write is not how lazy my fellow developers are, but rather how to get around a situation when it is not ideal to do a comparison and branch according to the result.

Why would that be important at all?
There are a number of reasons why it is better to avoid branching, this article lists many examples for that and also explains the reasons. It basically boils down to the following reasons:

branching can cause cache misses;
the branching instruction is a useless overhead and it can be even slow on certain architectures;
conditional branching disrupts the pipelining of the execution (although there are pretty sophisticated techniques available to deal with that).

In our case none of these reasons apply, but unfortunately branching is not really possible due to the static flow analysis on the macroblock list.

To put it simply: each macroblock is depending on the results of the previously executed macroblocks, if we skip ahead then it is impossible to tell whether those required macroblocks were executed or not before the dependent macroblocks.

Why does this cause any trouble? Because sometimes it is pretty hard to avoid conditions inside the instruction implementations.

I already faced this issue earlier, when I started to work on the conditional branching and setting instructions (Bcc and Scc). There was no possible way to avoid the conditional branching for these instructions due to its very nature of the instructions.
I ended up creating a very specific macroblock which embeds the condition checking and branching, so the "inter-macroblock" flow analysis remains intact.

In the recent update you can find many bit field instructions which were complicated enough already and then the very same issue came up again:

For bit field instructions 0 (zero) bit field width means 32 bit actually. This doesn't sound that bad, however sometimes the bit field width is coming from an emulated data register instead of a statically encoded constant. In this case the decision must be done in the emulated code instead of in compiling time.

Not in every case, but quite often it is possible to calculate the result instead of making a comparison and branching.

For the bit field instructions the solution was (written in C code, just because it is easier to understand):

temp = width - 1;
width = width | (temp & 32);

(There is one condition, though: the width must be between 0 (zero) and 31 before this operation starts.)

Now, think about it a little bit and figure out on your own why does it work. I am not going to explain it. :)

See you soon.

Wednesday, October 2, 2013

Back on track

Finally, I am back on track with the development after the house move. There were two updates since the last post; one was a follow-up to the previous code refactoring; the other was a massive list of improvements:

Implementation of NEGX.x mem, NEGX.x Dy, SUBX.x -(Ay),-(Az), ADDX.x -(Ay),-(Az), MOVE CCR,mem, MOVE CCR,Dx, EOR #imm,CCR, OR #imm,CCR, AND #imm,CCR, RTR, MOVE mem,CCR, MOVE Dx,CCR, MOVE #imm,CCR, ASR.x Dy,Dz, RTD, MULU.W mem,Dx, MULU.W #imm,Dx, MULS.W mem,Dx, MULS.W #imm,Dx, SUBX.x Dy,Dz, ADDX.x Dy,Dz, MOVEM.x regs,mem, MOVEM.x mem,regs, MOVEM.x (Ay)+,regs, CMPM.x (Ax)+,(Ay)+ instructions.
Added dependency tracking for non-volatile PowerPC registers.
Fixed X flag handling in register-based shifting instructions, previously the X flag was cleared together with the C flag if the shift steps were zero.
Removed RTM from the list of the potentially supported opcodes.
Added RTR back to the list of the potentially supported opcodes.
Optimized temporary register usage in MULU.W Dx,Dy and MULS.W Dx,Dy instructions.
Introduced tracking of the extension words after the instructions, it is needed for adjusting the PC before certain addressing modes are processed.
Fixed register dependency and order of register storage for MOVEM.x regs,-(Ay) when direct memory access is enabled.
Implemented stack-like concept for register saving into the context.
Code cleanup: removed unused reference, fixed some warnings regarding misformatted and unused code lines.

So, things are getting together slowly. The number of the implemented instructions went up to 321 out of the planned 387! That is roughly 82%... Getting closer and closer... :)

Recently I faced an interesting problem, I am not quite sure how can I solve: division by zero. My beloved math teacher already told me: who is trying to divide by zero is an idiot. (Well, that is not quite right, as we know it.) Yet, some programs might try it.
Why is that a problem? Because it triggers an exception inside a compiled block. It also needs branching (skip the exception triggering if the divisor is not null, for a change) which contradicts the macroblock register flow tracking. Well, here is the challenge, but I am pretty sure I will solve it somehow.

'Til then the usual: watch this space.

Sunday, May 19, 2013

One small step for mankind, a giant leap for the project

I have no idea how did I manage to achieve this much in this update, but it is certainly a confident step forward. For this time the list is long and diversified:

Implementation of Bcc.x addr, BCHG.B Dx,mem, BCHG.L Dx,Dy, BCLR.B Dx,mem, BCLR.L Dx,Dy, BRA.x abs, BSET.B Dx,mem, BSET.L Dx,Dy, BSR.x abs, BTST.B #imm,mem, BTST.B Dx,#imm, BTST.B Dx,mem, BTST.L Dx,Dy, CMP.x #imm,mem, CMP.x mem,Dy, CMP.x reg,Dy, CMPA.L reg,Ax, CMPA.W reg,Ax, CMPA.x mem,Ay, DBF.W Dx,addr, EOR.x #imm,mem, JMP.L abs, JMP.L mem, JSR.L abs, JSR.L mem, NEG.x mem, NOT.x mem, RTS, TAS.B Dx, TAS.B mem instructions.
Cache invalidation fix for OSX 10.3.9 and below. (Thanks to Mike Blackburn again.)
Fixed mask handling in BCHG.B Dx,mem instruction.
Fixed missing register mapping in ASL.x #imm,Dy implementation.
Fixed input dependency overwriting in certain memory-related allocation functions.
Fixed dependency for destination memory pointer register in special memory reading.
Fixed post address handler for condition code addressing modes, previously it might crash or call some random handler from the other addressing modes.
Fixed instructions where temporary registers are allocated but not free'ed.
Optimized masking for register to register bit instruction.
Optimized the temporary register usage in helper_test_bit_register_register function.
Optimized flag extraction in several shifting operation.
Branch scheduling is more flexible: adding multiple interleaved branches is possible.
Comment on missing implementation for an exception on loading odd address into PC.

A few highlights

First of all, let me brag around a little bit about the number of freshly implemented instructions. Right now 237 instructions are implemented out of 388, a solid 61% is done. (Previously the ratio was ~46%.)

More MacOSX versions are supported now, Mike fixed up the cache flushing a little bit and added the pre-10.4 versions too. Please read the included README file regarding the compiling instructions.

While I was working on the instructions I discovered a few bugs and glitches, which are now fixed in this release thus improving the overall stability.

I have also managed to optimize the compiled code for some instructions. Together with the implementation of some yet missing instructions the results for the Mandelbrot test (mandel_though_hw.kick.gz among the test kick files) improved a bit compared to the previous results:

Interpretive: 108 seconds (no change there...);
JIT compiled without optimization: 44 seconds (previously it was 52 seconds);
JIT compiled with optimization: 27 seconds (previously it was 32 seconds).

That was the time for the self-polishing and now back to work...

Wednesday, February 6, 2013

Can I haz time machine?

Another busy month passed, here comes the new update with lots of bug fixes and some freshly implemented instructions:

Implementation of

AND.x reg,reg;
EOR.x reg,reg;
EOR.x reg,mem;
SUBA.x reg,reg;
SUBA.x #imm,reg;
BSET.L #imm,reg;
BCLR.L #imm,reg;
EXG.L reg,reg and
NOP :) instruction.

Reorganized MOVE.x mem,mem instruction: memory reading and writing is separated out into independently callable functions to support the implementation of other similar instructions.
Implementation of OR.x reg,reg instruction was adapted to a more generic form.
Fixed flag checking for AND(I).(W|B) #imm,reg instructions.
Removed confusing OR immediate macroblock and replaced by the already available OR low immediate.
Falling back from addressing mode d16(ax) to (ax) if the offset is zero.
Fixed depenency flag handling in result check helper function.
Cleaned up opcode description table. Removed instructions from the table and from source code which won't be supported by the JIT compiler. Added missing RTM instruction.
Temporary register storage slots were moved from the stack frame to the Regs structure (context).
Removed useless debug log that flooded the output.
Temporarily disabled MOVE.x mem,mem and EOR.x reg,mem instructions to let the Kickstart run.

I have good news and not-so-good news

Which one should we start with? Okay, let me choose for you: not-so-good news first.
At the moment (beside the yet unimplemented instructions) there are two bugs that prevent the OS from running properly:

For some unknown reason when an instruction reads and then writes the memory then the Kickstart goes back to the well-known reboot loop (actually it is a crash, sometimes you can even see the Guru message).
At the moment I don't have the slightest idea why this happens. I traced it back to the normal MOVE instruction, when it copies data from one address to another. There is nothing wrong with the instruction implementation itself - as far as I can tell. So, this is some weird sh*t again. Lovely.
When the optimize flag is turned on (comp_optimize = true in the config) then the Kickstart crashes very early.To tell you the truth, I am not surprised. When I tried to figure out the register dependency for each macroblock for each instruction then sometimes I mixed up stuff what I found later on. Due to the wrong register dependency settings for some macroblocks, these were accidentally removed, so some code is "optimized away". Most likely there are more mix-ups and missing dependencies in the code, it can be found just matter of time.

Good news:
As you can see in the last item for the update: I temporarily disabled the two already implemented instruction which tries to read and write the memory. The missing instructions are substituted by using the interpretive instructions.
Now, if you disable the optimize option in the configuration (comp_optimize=false) then the Kickstart seems working with some actual JIT compiled instructions! YAY! \o/ (I guess.)

Some more interesting problems

I have got some feedback about issues with running Kickstart 1.3 and some old Amiga500 games when the JIT is enabled. This is interesting indeed because if even if the JIT was turned on it did not emulate these old codes because the cache is never turned on (there was no cache in the Motorola 68000 processors when these programs were written).
As you can read it in the FAQ: the JIT compiling is depending on the cache emulation heavily. So, this is one more question mark. I haven't had much time to investigate it yet.

But most importantly: the Summer Sun is shining, let's go surfing! (Oh, wait. I don't do surfing. Last time when I jumped onto a bodyboard on the beach I bruised my ribs. Embarrassing and totally geeky. Let's just surf the net, shall we?)

Tuesday, October 23, 2012

Address Unknown

Finally all the normal addressing modes are done in this update. As I mentioned in the previous post: the final missing pieces were the 68020 (complex) addressing modes.
This took me lot longer than I expected. Not just because these addressing modes are complicated, but also because memory reading is involved.

Secretly I hoped that this will fix the issue with the OS, but no. I am not sure why I expected that. :P
The OS still behaves the same: reboot loop.

You can stop reading now, technical details will follow.

Why does memory access make everything much more complicated?

The answer is: every memory access has to go through the memory handling functions in the emulator. I already implemented the possibility of direct memory access, but we cannot depend on that every time (or rather most of the time).

Then why is that causing any headaches? Because the functions are outside of the translated code, coded in C and all assigned temporary registers will be gone.

Previously, for the MOVE instructions I workarounded the situation by remapping and reloading only the required temporary registers as soon as the memory operation was done. This approach is simply not working in the case of the addressing mode translation: the addressing modes are completely isolated subroutines, do not know any details of the already allocated and mapped registers (that should be restored).

For now I implemented a different workaround for the addressing modes, which is far less optimal: instead of dropping all the temporary registers, I save them on the stack then restore it when the execution returns to the translated code.

No need to say how much slower this is: saving and reloading all mapped temporary registers than dealing with the absolutely required registers only. I am not satisfied with this solution, but at least it is working. Unless the called function expects a proper stackframe, because there won't be any. (At least this is not the case for OS4, and most likely not for MorphOS either.) This solution is not strictly compatible with the SysV ABI.

Probably it would make more sense to store the temporary registers in the static context structure, there is no need to be reentrant: there are no call backs to the translated code. Maybe I can revisit this whole piece of code at a later stage.

Monday, October 8, 2012

Jinxed it (more Apples)

Speaking of apples. As it seems I managed to upset the Gods with the previous post somehow: they unleashed their wrath on me. Apple had released iOS6.

The funny part was: the developer pre-release was all hunky dory with our app, but then came the final release and things have changed dramatically overnight. Since I am working as a mobile app developer recently, I had to spend lot of time on it, even some weekend day(s).Okay, I stop grumbling, we are finally managed to take everything under control. It was stressful and annoying.

So, I guess you might guess why the recent update is so thin: when I finally managed to get home I was fully drained, I had no strength to look at one more line of code.
I have implemented three missing addressing mode which are included in this update. I also started to work on the complex addressing modes (also called 68020 addressing modes), which are... well... really complex. I am less than half way thru with them, but in the meanwhile I didn't want to hold back this:

G5 support

While I suffered in deep agony, luckily Tobias Netzel was busy again and managed to fix up E-UAE for the G5 (PowerPC 970) processors. He got rid of the mcrxr emulated unsupported instruction (see the chapter about mcrxr) for this processor type. He also done some optimizations regarding the microcoded instruction usage, but probably we need to do more about it once.

The situation was very much similar to the 68060 and some unsupported FPU instructions which were emulated by the OS. If you remember that time how much Oxypatcher increased the speed of some floating point calculation intensive apps you can understand why E-UAE was soo sluggishly slow on G5 before: almost half of the emulated instructions make use of the mcrxr instruction for emulating some arithmetic flags.
With the recent changes this PPC instruction is not used for anything if the emulator was compiled for G5 and the fix helps a lot on the interpretive emulation too.

Some benchmarks from Tobias using the Mandelbrot test (G5 - 2.1 GHz):

Interpretive emulation:

using mcrxr: 5:02 secs
without mcrxr: 59 secs

JIT without flag optimization:

using mcrxr: 7:46 secs (yes, even slower than the interpretive...)
without mcrxr: 27.5 secs

JIT with flag optimization:

using mcrxr: 1:47 sec
without mcrxr: 18 sec

Well done, Tobias! The PPC Mac users will be grateful again.
By the way, these changes might have affect on the Cell PPE and Xenon processors too. Any volunteering developers?

Bounty

Recently, I had a cautious look on the bounty page for the project and I was a bit shocked by the fact that Mr. William Morris donated 1000 EUR. That was half of the previously collected bounty. Very generous of you.

Maybe this post is not the best opportunity to thank you all for your support. I hope I fulfill the expectations rather sooner than later.

Thursday, August 30, 2012

Optimize It

I am back from the holidays for sometime now, but I got swamped instantly by the work in my daytime job. I had very little energy on the project in the nights, which was spent on two things: chasing that #@!% bug which is killing me and implementing the data-flow optimization.

Needless to say that I failed to track down the bug again, just like every time I spent hours on looking at couple megabytes of log dumps. Next time I will try a new approach, suggested by Stephen Fellner: reducing the code complexity while the bug can be reproduced.
We will see how it goes. The processor cache emulation is certainly complicates things, that can be eliminated at least.

While I acknowledged the failure again, I gathered all my previous thoughts and implemented the data-flow based optimization which does a really nice job.

The update is small, but highly important again:

Implementation of code optimization.
New configuration was introduced: comp_optimize to turn it on/off.
Bugs in the register input/output flag specifications are fixed.
Fixed too small string buffer in the 68k disassembler, previously the debugger crashed every time when memory was disassembled to the screen.
Implementation of MOVEM.x regs,-(Ay)

What, how and why?

As I tried to explain earlier in this post some of the compiled code is completely useless. The emulated instruction consists of the following parts usually:

Initialization of certain temporary registers;
Executing the actual operation;
Alter the arithmetic flags according to the result;
Save the result somewhere (into memory)

Some of these operations are not common or can be done for a series of instructions ahead, like loading the previous data for the emulated registers into temporary registers, but there are parts that cannot be avoided if we examine the emulated instruction out of its context.

For example the arithmetic flags are overwritten quite often by the following instructions, what means: if the next instruction(s) are not depending directly on the flag results then we can remove the code that generates these.

Typically, if we (as mighty humans) look at the instructions in a block we can easily identify the context, we can tell mostly which instructions produce usable results and what is not important. Like in most of the cases simple for a human, but it is a really complex job for the code compiler.

As I previously described: the emulated code is broken up into macroblocks by the precompiling. These blocks represent an "atomic" operation, like load data into temporary register, compare two registers or calculate certain arithmetic flags from the previous result.
When I started working on this concept I figured out that if I would be able to identify exactly for each macroblocks the previous result(s) that it is depending on and the result(s) that it produces then I can evaluate the dependencies between the subsequent macroblocks.

Going with the flow

For Petunia I already implemented a similar concept, but that was limited to the flag usage and it is not able to split up the emulation of the arithmetic flags into individual flag registers: it is either emulated completely or removed completely. (Which means that even if we needed only the C register later the rarely used X register will be calculated too, usually after the C register was already done. If you don't understand the reason behind this: don't worry - you need to know more about 68k assembly.)

For E-UAE the implementation is radically different. For any macroblock there is the possibility of specifying each and every emulated register, flag or temporary register as dependecy (input) and/or as result (output).

In the recent changes I implemented the rather simple solution for calculating the data-flow for each registers after all macroblocks are collected for a block of instructions.

It is capable of finding out for each macroblock whether the produced results are relevant for the following instructions in the block or not. If not then that macroblock can be eliminated from the compiled code because no instruction is depending on its results.

How to use it

Although, the optimization is completely safe (it won't remove any code that is essential) while the emulation is not stable enough there might be some bugs. So, I introduced a new configuration for turning it on/off, called: comp_optimize.
It replaces the comp_nf configuration from the x86 JIT implementation, because it is not just about the flags (nf = no flags).

If it was set to true then the data-flow calculation is done and some macroblocks will be removed. By setting it to false the emulation compiles all the instructions fully into the buffer.

The results

And finally some speed tests... Compared to the previously published Mandelbrot test results the actual numbers are:

Interpretive: 108 seconds;
JIT compiled without optimization: 52 seconds;
JIT compiled with optimization: 32 seconds.

That is roughly 40% speed increase in the case of this (heavily arithmetic) test.

The test system was: Micro AmigaOne (G3/800 MHz) - let's compare it to WinUAE that is running on my laptop (Intel Core i3 M350/2.27 GHz):

JIT compiled with all the possible optimizations turned on: 9 seconds.

It would be a tough job to compare these two computers, but I am pretty sure that my laptop is more than 3.5x faster than that poor old G3 machine.

I am really content with these results for now.