The Big E-UAE JIT blog: Lies, Damned Lies, Statistics and Benchmarks

Sunday, July 1, 2012

Lies, Damned Lies, Statistics and Benchmarks

Another nice batch of instructions implemented in the recent update. This is an important milestone: most of the instructions are JIT compiled now, that are used in the Mandelbrot test. Why is it important?

Because finally we can do some benchmarking!

Source: http://wumocomicstrip.com/2012/04/16

The update

So, the changes in this update are:

Implementation of
ADD.x #imm,Dy,
ADD.x regy,Dz,
ADDA.x regy,Az,
ADDI.x #immq,Dy,
AND.x #imm,Dy,
ANDI.x #immq,Dy,
ASL.x #imm,Dy,
ASR.x #imm,Dy,
BTST.L #imm,Dx,
CMP.x #imm,Dy/Ay,
CMPA.W #imm,Ay,
CMPI.x #immq,Dy,
EOR.x #imm,Dy,
EORI.x #immq,Dy,
LSR.x #imm,Dy,
MULS.W Dx,Dy,
MULU.W Dx,Dy,
ORI.x #immq,Dy,
ROL.x #imm,Dy,
SUB.x regy,Dz,
TST.x reg instructions.
Fixed OR.x reg,reg flag dependency.
Recycling of mapped temporary registers is implemented.
Added the possibility of inverting C flag after extracting.
CMP.L #imm,Ax and CMP.W #imm,Ax were removed from the table, there are no such instructions.

Phew, that was a nice long list.

Recycling of the temporary registers

While I was running amok on implementing the selected instructions for the Mandelbrot test, I had to face the inevitable: the shortage of the temporary registers. When I implemented the register allocation subsystem I tried to postpone the fix for this problem, but since a lot more instructions are implemented the consistent translated chunks are getting longer. Suddenly, I got the error message that I wrote a couple months ago:

“Error: JIT compiler ran out of free temporary registers”

Probably this error needs some explanation. Let’s start with how the registers are handled in the JIT engine:

In Petunia the emulated 68k registers are statically mapped to PowerPC registers. This approach has its benefits, like constructing the code that handles the emulated registers is much more easy. But also has some major issues, especially because all registers must be up-to-date at anytime. Whenever the execution is leaving the compiled code all registers must be saved and restored to the previous state. Needless to say how long that takes: more than 20 registers must be saved.

In E-UAE this would be a no-go. The compiled code consists of smaller contiguous chunks and the support functions from the environment must be called quite often. The rest of the emulator was written in C and it expects a certain state of the registers at a time (which is described in details in the PowerPC SysV ABI document).

How to overcome of these issues? First, I realized that there is no point in keeping all the contents of the emulated registers in real registers all the time. The usage of certain emulated registers are localized in a certain range of the code that consists of closely tied instructions mostly.

Then I figured out from the SysV ABI that I can safely use a range of registers that needs no saving/restoring and also I can make use of the non-volatile registers that must be restored, because other called functions must restore that too.

I ended up implementing a dynamically allocated register serving system that can handle the needs of the compiling.

There are two types of registers: the temporary registers that are used in one instruction only for a specific role and the mapped 68k registers that are loaded once and kept until the registers must be released because the execution leaves the compiled code.

Back to the error message: eventually the free registers are running out because the compiler keeps the mapped 68k registers between instructions. I had to figure out some solution for enforcing releasing a previously mapped register when there are no more free registers left. Also, I had to make sure that I wouldn’t release the register that was allocated just recently and needed for the actual instruction compiling (or for the following instructions that might depend on the same 68k register).

This issue was resolved by some simple locking mechanism for the mapped registers: when a register is mapped then in the same time it is locked for the actual instruction. At the end of an instruction all mapped 68k registers are unlocked but not released.
On enforced releasing of the registers I look for an unlocked mapped register and release it.
But I start searching for the next unlocked register in a round-robin manner, to make sure I won’t release-allocate-release the same register all the time.

This system looks robust enough to handle the needs of the compiling.

Back to the lies… errr… benchmarks

As I already mentioned: in the implementation I reached the point when almost all instructions are implemented that are needed by the Mandelbrot test. So let’s see how the compiled code performs.

The test system consists of a Micro AmigaOne (G3/800 MHz), a slightly outdated AmigaOS4 version and my beloved Galaxy S2 who acts as a stopwatch. (Yes, I measure the time by using stopwatch. I didn’t want to spend too much effort on figuring out how to measure the time more precisely.)

Running the Mandelbrot test, I got the following results:

Interpretive: ~6 seconds,
JIT compiled: ~3 seconds.

Not too precise, but as it seems the JIT compiler is doing nice, the compiled code finished twice as fast as the interpretive.

To confirm the results I created a new version of the Mandelbrot test that needs lot more processor power to complete. You can find it among the test kickstarts, it is called: mandel_though_hw.kick. The code is essentially the same as the previous Mandelbrot test, but I tweaked the parameters a bit: zoomed in and adjusted the drop out threshold, so the picture is more detailed.

Honey! The Mandelbrot is almost done in the oven!

The results are:

Interpretive: 108 seconds,
JIT compiled: 52 seconds.

Yay! It is indeed twice as fast! Rejoice!

Disclaimer

Why we must not jump to any conclusions just yet regarding the speed...

First of all, not all instructions are implemented yet. This most likely means more speed increase, because the JIT instructions are always faster or at least as fast as the interpretive implementation. When the emulation hits an not implemented instruction then it has to call the interpretive implementation, which means essentially storing all the emulated registers in the memory and restart the register mapping.

Also, the register flow optimization is not implemented yet, that will be certainly a big boost on the speed, it will eliminate lots of instructions from the compiled code.

On the other hand, this test relies more on the rather simple, mostly processing-demanding instructions. When the test code is doing much more memory accessing then the difference between the interpretive and the JIT compiled will be almost certainly much smaller.

Anyway, I hope I assured all of you that we are heading to the right direction.

Full speed ahead! (Don’t mind the mines.)

25 comments:

Anonymous02 July, 2012 06:31
I tried to compile it on morphos and after some configure hickups I was able to do it. With JIT, the emulator crashes with normal kickstart and a game. Maybe I can find out, why this happens. Next time I try your mandelbrot test.
Nice implementation if you earned apprx 50% for now. Respect. Even if it would shrink to 30% this is still great.
ReplyDelete
Replies
Timmo03 July, 2012 01:33
Very exciting!
A question for you - there are certain implementations of the PPC that have ever so slightly different instruction sets and registers. Is this an issue for your dynamic register serving implementation? (I couldn't work out if you're handling the PPC registers directly, or if you're leaving it up to the compiler)

I'm specifically talking about the Cell processor used in the PS3 - though I expect in most ways this doesn't present an issue.
ReplyDelete
Replies
Anonymous03 July, 2012 05:44
Okay, it seems that the initial "illegal instruction" let the application hang/crash. If you switch to 68000 it will go through. On a 68020 it will crash without showing "illegal instruction", but B-Trap.
I found the code in newcpu.c and commented the B-Trap out for testing purposes. And then the emulator did not crash anymore, but just let the system to stop booting. Everything else still worked. Maybe somewhere in the illegal op routines there is the bug?
I found out that it went 10 times through your mentioned code in compemu_support.c and does not crash within this code, but after leaving it for the 10th time.
ReplyDelete
Replies
Álmos Rajnai03 July, 2012 20:03
I just edited this post slightly. When I re-read it I have found some annoying mistakes and way too complex sentences. I hope it is a bit more clear now.

(I know most of the readers are interested in that four numbers only at the middle... ;)
ReplyDelete
Replies
Anonymous04 July, 2012 22:20
Btw, your function ppc_cacheflush is not checked in yet. For my testing, I commented this line. The mandelbrot tests only show a colored frame but no graphics at all :( Hope I can fix it.
ReplyDelete
Replies
Anonymous06 July, 2012 23:41
Dont use CacheClearE() function in MorphOS because it only clears 68k CPU caches (JIT). PPC caches are controlled using following functions:

* CacheFlushDataArea
* CacheFlushDataInstArea
* CacheInvalidDataArea
* CacheInvalidInstArea
* CacheTrashDataArea

See Exec autodocs for more information. CacheFlushDataInstArea is probably a call to make.
ReplyDelete
Replies
Anonymous07 July, 2012 19:46
Memory is allocated using MEMF_ANY because the NX bit is not used in MorphOS. To align the allocated code block you have to query L1 instruction cache size:

#include
#include

int main(void)
{
ULONG cachelinesize;

if (NewGetSystemAttrsA(&cachelinesize, sizeof(cachelinesize), SYSTEMINFOTYPE_PPC_ICACHEL1LINESIZE, NULL))
{
printf("L1 instruction cache line size is %d bytes.\n", cachelinesize);
}

return 0;
}

On my machine this would print "L1 instruction cache line size is 32 bytes.".

To allocate aligned memory block you call AllocMemAligned() or AllocVecAligned() or even AllocPooledAligned():

void *ptr = AllocVecAligned(size, MEMF_ANY, l1_icache_linesize, 0);

Last parameter is the align offset if you need to store additional data before the aligned memory area begins.
ReplyDelete
Replies

Add comment