The Big E-UAE JIT blog: Optimize It

Thursday, August 30, 2012

Optimize It

I am back from the holidays for sometime now, but I got swamped instantly by the work in my daytime job. I had very little energy on the project in the nights, which was spent on two things: chasing that #@!% bug which is killing me and implementing the data-flow optimization.

Needless to say that I failed to track down the bug again, just like every time I spent hours on looking at couple megabytes of log dumps. Next time I will try a new approach, suggested by Stephen Fellner: reducing the code complexity while the bug can be reproduced.
We will see how it goes. The processor cache emulation is certainly complicates things, that can be eliminated at least.

While I acknowledged the failure again, I gathered all my previous thoughts and implemented the data-flow based optimization which does a really nice job.

The update is small, but highly important again:

Implementation of code optimization.
New configuration was introduced: comp_optimize to turn it on/off.
Bugs in the register input/output flag specifications are fixed.
Fixed too small string buffer in the 68k disassembler, previously the debugger crashed every time when memory was disassembled to the screen.
Implementation of MOVEM.x regs,-(Ay)

What, how and why?

As I tried to explain earlier in this post some of the compiled code is completely useless. The emulated instruction consists of the following parts usually:

Initialization of certain temporary registers;
Executing the actual operation;
Alter the arithmetic flags according to the result;
Save the result somewhere (into memory)

Some of these operations are not common or can be done for a series of instructions ahead, like loading the previous data for the emulated registers into temporary registers, but there are parts that cannot be avoided if we examine the emulated instruction out of its context.

For example the arithmetic flags are overwritten quite often by the following instructions, what means: if the next instruction(s) are not depending directly on the flag results then we can remove the code that generates these.

Typically, if we (as mighty humans) look at the instructions in a block we can easily identify the context, we can tell mostly which instructions produce usable results and what is not important. Like in most of the cases simple for a human, but it is a really complex job for the code compiler.

As I previously described: the emulated code is broken up into macroblocks by the precompiling. These blocks represent an "atomic" operation, like load data into temporary register, compare two registers or calculate certain arithmetic flags from the previous result.
When I started working on this concept I figured out that if I would be able to identify exactly for each macroblocks the previous result(s) that it is depending on and the result(s) that it produces then I can evaluate the dependencies between the subsequent macroblocks.

Going with the flow

For Petunia I already implemented a similar concept, but that was limited to the flag usage and it is not able to split up the emulation of the arithmetic flags into individual flag registers: it is either emulated completely or removed completely. (Which means that even if we needed only the C register later the rarely used X register will be calculated too, usually after the C register was already done. If you don't understand the reason behind this: don't worry - you need to know more about 68k assembly.)

For E-UAE the implementation is radically different. For any macroblock there is the possibility of specifying each and every emulated register, flag or temporary register as dependecy (input) and/or as result (output).

In the recent changes I implemented the rather simple solution for calculating the data-flow for each registers after all macroblocks are collected for a block of instructions.

It is capable of finding out for each macroblock whether the produced results are relevant for the following instructions in the block or not. If not then that macroblock can be eliminated from the compiled code because no instruction is depending on its results.

How to use it

Although, the optimization is completely safe (it won't remove any code that is essential) while the emulation is not stable enough there might be some bugs. So, I introduced a new configuration for turning it on/off, called: comp_optimize.
It replaces the comp_nf configuration from the x86 JIT implementation, because it is not just about the flags (nf = no flags).

If it was set to true then the data-flow calculation is done and some macroblocks will be removed. By setting it to false the emulation compiles all the instructions fully into the buffer.

The results

And finally some speed tests... Compared to the previously published Mandelbrot test results the actual numbers are:

Interpretive: 108 seconds;
JIT compiled without optimization: 52 seconds;
JIT compiled with optimization: 32 seconds.

That is roughly 40% speed increase in the case of this (heavily arithmetic) test.

The test system was: Micro AmigaOne (G3/800 MHz) - let's compare it to WinUAE that is running on my laptop (Intel Core i3 M350/2.27 GHz):

JIT compiled with all the possible optimizations turned on: 9 seconds.

It would be a tough job to compare these two computers, but I am pretty sure that my laptop is more than 3.5x faster than that poor old G3 machine.

I am really content with these results for now.

12 comments:

Tobias31 August, 2012 11:33
Hi, I took the challenge to port your JIT to Mac OS X 10.5 (latest for PowerPC CPUs).
As the Darwin ABI is somewhat different to SysV I adapted prolog and epilog and also supplied the correct OS X cache flush function call.
Now the JIT actually compiles some code (seemingly ABI compliant) and calls it successfully. The code then branches successfully to custom_wput() and also returns from there correctly.
Unfortunately some branch targets are computed wrong (the one to custom_wput is obviously correct). Here the disassembled JIT code (stack handling is different on OS X):
0x074390c8: mflr r3
0x074390cc: stw r3,8(r1)
0x074390d0: stw r13,-4(r1)
0x074390d4: stw r14,-8(r1)
0x074390d8: stwu r1,-48(r1)
0x074390dc: lis r13,105
0x074390e0: ori r13,r13,43384
0x074390e4: lis r3,1819
0x074390e8: ori r3,r3,36890
0x074390ec: lwz r4,76(r13)
0x074390f0: cmplw r3,r4
0x074390f4: beq+ 0x7439118
0x074390f8: lwz r1,0(r1)
0x074390fc: lwz r0,8(r1)
0x07439100: mtlr r0
0x07439104: lwz r13,-4(r1)
0x07439108: lwz r14,-8(r1)
0x0743910c: lis r3,1819
0x07439110: ori r3,r3,36890
0x07439114: b 0x826ead0
0x07439118: lwz r14,64(r13)
0x0743911c: lwz r3,68(r13)
0x07439120: rlwimi r14,r3,16,26,26
0x07439124: lwz r3,4(r13)
0x07439128: li r4,1
0x0743912c: addco. r3,r3,r4
0x07439130: mcrxr cr2
0x07439134: mfcr r14
0x07439138: rlwimi r14,r14,16,26,26
0x0743913c: lwz r4,32(r13)
0x07439140: addi r5,r4,384
0x07439144: extsh. r0,r3
0x07439148: mfcr r6
0x0743914c: rlwimi r14,r6,0,0,2
0x07439150: rlwinm r14,r14,0,11,8
0x07439154: stw r3,4(r13)
0x07439158: mr r4,r3
0x0743915c: mr r3,r5
0x07439160: rlwinm r0,r3,18,14,29
0x07439164: lis r5,109
0x07439168: ori r5,r5,43688
0x0743916c: lwzx r5,r5,r0
0x07439170: lwz r5,16(r5)
0x07439174: mtlr r5
0x07439178: blrl
0x0743917c: stw r14,64(r13)
0x07439180: sth r14,68(r13)
0x07439184: lis r3,1819
0x07439188: ori r3,r3,36896
0x0743918c: stw r3,76(r13)
0x07439190: li r3,24824
0x07439194: lis r4,105
0x07439198: ori r4,r4,43384
0x0743919c: bl 0x80dc1f0
0x074391a0: stw r14,64(r13)
0x074391a4: sth r14,68(r13)
0x074391a8: li r3,6656
0x074391ac: bl 0x826ed40
0x074391b0: lwz r1,0(r1)
0x074391b4: lwz r0,8(r1)
0x074391b8: mtlr r0
0x074391bc: lwz r13,-4(r1)
0x074391c0: lwz r14,-8(r1)
0x074391c4: blr
End of assembler dump.

All the branches to absolute addresses are incorrect. Any ideas?

Thanks for your great effort so far!
Tobias
ReplyDelete
Replies
Tobias31 August, 2012 22:30
Well, not all the absolute branches are incorrect; the local branch "beq+ 0x7439118" is correct.
So it branches at 0x074390f4 to 0x7439118, continues until 0x07439178 where it branches to custom_wput. After returning from custom_wput it continues at 0x0743917c and continues until 0x0743919c where it branches to some code of some other dynamic library where sooner or later crashes.

The console log shows:
JIT: Flush icache soft (1/0/0x0)
JIT: Flush icache hard (soft cache flush/0/0x0)
JIT: Flush icache soft (1/0/0x0)
JIT: Flush icache hard (soft cache flush/0/0x0)
JIT: Flush icache soft (1/0/0x0)
JIT: Flush icache hard (soft cache flush/0/0x0)
JIT: Flush icache soft (1/0/0x0)
JIT: Flush icache hard (soft cache flush/0/0x0)
JIT: Flush icache soft (1/0/0x0)
JIT: Flush icache hard (soft cache flush/0/0x0)
JIT: Flush icache soft (1/0/0x0)
JIT: Flush icache hard (soft cache flush/0/0x0)
JIT: Flush icache soft (1/0/0x0)
JIT: Flush icache hard (soft cache flush/0/0x0)
JIT: Flush icache soft (1/0/0x0)
JIT: Flush icache hard (soft cache flush/0/0x0)
JIT: Flush icache soft (1/0/0x0)
JIT: Flush icache hard (soft cache flush/0/0x0)
Building CPU table for configuration: 68020 prefetch
1866 CPU functions
Building CPU function table, 45674 opcodes (2 1 0).
JIT: Building compiler function table.
JIT: Allocation of translation cache...
JIT: Translation cache size in prefs: 8192
JIT: Allocated 8192 KB translation cache.
JIT: Flush icache soft (1/0/0x7239000)
JIT: Flush icache hard (soft cache flush/0/0x7239000)
JIT: Flush icache soft (1/0/0x7239000)
JIT: Flush icache hard (soft cache flush/0/0x7239000)
JIT: Flush icache soft (1/0/0x7239000)
JIT: Flush icache hard (soft cache flush/0/0x7239000)
JIT: Flush icache soft (1/0/0x7239000)
JIT: Flush icache hard (soft cache flush/0/0x7239000)
JIT: Flush icache soft (1/0/0x7239000)
JIT: Flush icache hard (soft cache flush/0/0x7239000)
JIT: Flush icache soft (1/0/0x7239000)
JIT: Flush icache hard (soft cache flush/0/0x7239000)
JIT: Flush icache soft (1/0/0x7239000)
JIT: Flush icache hard (soft cache flush/0/0x7239000)
JIT: Flush icache soft (1/0/0x7239000)
JIT: Flush icache hard (soft cache flush/0/0x7239000)
JIT: Flush icache soft (1/0/0x7239000)
JIT: Flush icache hard (soft cache flush/0/0x7239000)
reset at 0
Kickstart version is below 1.3! Disabling autoconfig devices.
JIT: Flush icache soft (1/0/0x7239000)
JIT: Flush icache hard (soft cache flush/0/0x7239000)
JIT: Compiling reset
JIT: Change cache emulation: disabled
PAL mode, 50Hz (h=227 v=312)
chipmem cleared
JIT: Change cache emulation: enabled
JIT: Flush icache hard (cache state change/8/0x71b900e)
JIT: Flush icache soft (1/8/0x71b900e)
JIT: Flush icache hard (soft cache flush/8/0x71b900e)
JIT: JIT: compile code, pc: 00000008, block length: 7, total cycles: 15872
JIT: Compiling enabled, block length: 7
JIT: JIT: compile code, pc: 0000001a, block length: 3, total cycles: 6656
JIT: Compiling enabled, block length: 3
JIT: JIT: compile code, pc: 0000001a, block length: 3, total cycles: 6656
JIT: Compiling enabled, block length: 3
JIT: JIT: compile code, pc: 0000001a, block length: 3, total cycles: 6656
JIT: Compiling enabled, block length: 3
JIT: JIT: compile code, pc: 0000001a, block length: 3, total cycles: 6656
JIT: Compiling enabled, block length: 3
JIT: JIT: compile code, pc: 0000001a, block length: 3, total cycles: 6656
JIT: Compiling enabled, block length: 3
JIT: Init compiling
JIT: Compiled code start: 074390c8
JIT: Comp: 0000001a 5281 ADD.L #$00000001,D1
JIT: Comp: 0000001c 3141 0180 MOVE.W D1,(A0,$0180) == $00dff180
JIT: Comp: 00000020 60f8 BT.B #$fffffff8 == 0000001a (TRUE)
JIT: Unsupported opcode: 0x60f8
JIT: Done compiling
ReplyDelete
Replies
Joeled01 September, 2012 21:28
@tobias
Dont forget to donate to the bounty.
http://amigabounty.net/index.php?function=viewproject&projectid=35
ReplyDelete
Replies
Tobias02 September, 2012 00:35
Speed numbers are as follows on a PowerBook G4 12" 1.5 GHz, running OS X 10.5.8:
(Drawing has always been the bottleneck for UAE on OS X. The latest E-UAE came with experimental OpenGL support.)
SDL OpenGL
2:40 1:01 Interpretive
1:17 0:29 JIT compiled without optimization
0:50 0:18 JIT compiled with optimization

I had sound emulation turned off, cpu speed set to maximum, emulating a 68EC020, 1024 kB JIT cache.
Using a big cache size like 32 MB slows the JIT down significantly, maybe because then relative branches cannot be used?
ReplyDelete
Replies
Unknown28 June, 2013 11:00
This blog is really informative i really had fun reading it.
ReplyDelete
Replies

Add comment