This project is dedicated to the memory of William Morris (aka Frags), who was the main contributor to the bounty but was unable to see the final result.

Thursday, August 30, 2012

Optimize It

I am back from the holidays for sometime now, but I got swamped instantly by the work in my daytime job. I had very little energy on the project in the nights, which was spent on two things: chasing that #@!% bug which is killing me and implementing the data-flow optimization.

Needless to say that I failed to track down the bug again, just like every time I spent hours on looking at couple megabytes of log dumps. Next time I will try a new approach, suggested by Stephen Fellner: reducing the code complexity while the bug can be reproduced.
We will see how it goes. The processor cache emulation is certainly complicates things, that can be eliminated at least.

While I acknowledged the failure again, I gathered all my previous thoughts and implemented the data-flow based optimization which does a really nice job.

The update is small, but highly important again:
  • Implementation of code optimization.
  • New configuration was introduced: comp_optimize to turn it on/off.
  • Bugs in the register input/output flag specifications are fixed.
  • Fixed too small string buffer in the 68k disassembler, previously the debugger crashed every time when memory was disassembled to the screen.
  • Implementation of MOVEM.x regs,-(Ay)

What, how and why?

As I tried to explain earlier in this post some of the compiled code is completely useless. The emulated instruction consists of the following parts usually:
  • Initialization of certain temporary registers;
  • Executing the actual operation;
  • Alter the arithmetic flags according to the result;
  • Save the result somewhere (into memory)
Some of these operations are not common or can be done for a series of instructions ahead, like loading the previous data for the emulated registers into temporary registers, but there are parts that cannot be avoided if we examine the emulated instruction out of its context.

For example the arithmetic flags are overwritten quite often by the following instructions, what means: if the next instruction(s) are not depending directly on the flag results then we can remove the code that generates these.

Typically, if we (as mighty humans) look at the instructions in a block we can easily identify the context, we can tell mostly which instructions produce usable results and what is not important. Like in most of the cases simple for a human, but it is a really complex job for the code compiler.

As I previously described: the emulated code is broken up into macroblocks by the precompiling. These blocks represent an "atomic" operation, like load data into temporary register, compare two registers or calculate certain arithmetic flags from the previous result.
When I started working on this concept I figured out that if I would be able to identify exactly for each macroblocks the previous result(s) that it is depending on and the result(s) that it produces then I can evaluate the dependencies between the subsequent macroblocks.

Going with the flow

For Petunia I already implemented a similar concept, but that was limited to the flag usage and it is not able to split up the emulation of the arithmetic flags into individual flag registers: it is either emulated completely or removed completely. (Which means that even if we needed only the C register later the rarely used X register will be calculated too, usually after the C register was already done. If you don't understand the reason behind this: don't worry - you need to know more about 68k assembly.)

For E-UAE the implementation is radically different. For any macroblock there is the possibility of specifying each and every emulated register, flag or temporary register as dependecy (input) and/or as result (output).

In the recent changes I implemented the rather simple solution for calculating the data-flow for each registers after all macroblocks are collected for a block of instructions.

It is capable of finding out for each macroblock whether the produced results are relevant for the following instructions in the block or not. If not then that macroblock can be eliminated from the compiled code because no instruction is depending on its results.

 

How to use it

Although, the optimization is completely safe (it won't remove any code that is essential) while the emulation is not stable enough there might be some bugs. So, I introduced a new configuration for turning it on/off, called: comp_optimize.
It replaces the comp_nf configuration from the x86 JIT implementation, because it is not just about the flags (nf = no flags).

If it was set to true then the data-flow calculation is done and some macroblocks will be removed. By setting it to false the emulation compiles all the instructions fully into the buffer.

 

The results

And finally some speed tests... Compared to the previously published Mandelbrot test results the actual numbers are:

Interpretive: 108 seconds;
JIT compiled without optimization: 52 seconds;
JIT compiled with optimization: 32 seconds.

That is roughly 40% speed increase in the case of this (heavily arithmetic) test.

The test system was: Micro AmigaOne (G3/800 MHz) - let's compare it to WinUAE that is running on my laptop (Intel Core i3 M350/2.27 GHz):

JIT compiled with all the possible optimizations turned on: 9 seconds.

It would be a tough job to compare these two computers, but I am pretty sure that my laptop is more than 3.5x faster than that poor old G3 machine.

I am really content with these results for now.

12 comments:

  1. Hi, I took the challenge to port your JIT to Mac OS X 10.5 (latest for PowerPC CPUs).
    As the Darwin ABI is somewhat different to SysV I adapted prolog and epilog and also supplied the correct OS X cache flush function call.
    Now the JIT actually compiles some code (seemingly ABI compliant) and calls it successfully. The code then branches successfully to custom_wput() and also returns from there correctly.
    Unfortunately some branch targets are computed wrong (the one to custom_wput is obviously correct). Here the disassembled JIT code (stack handling is different on OS X):
    0x074390c8: mflr r3
    0x074390cc: stw r3,8(r1)
    0x074390d0: stw r13,-4(r1)
    0x074390d4: stw r14,-8(r1)
    0x074390d8: stwu r1,-48(r1)
    0x074390dc: lis r13,105
    0x074390e0: ori r13,r13,43384
    0x074390e4: lis r3,1819
    0x074390e8: ori r3,r3,36890
    0x074390ec: lwz r4,76(r13)
    0x074390f0: cmplw r3,r4
    0x074390f4: beq+ 0x7439118
    0x074390f8: lwz r1,0(r1)
    0x074390fc: lwz r0,8(r1)
    0x07439100: mtlr r0
    0x07439104: lwz r13,-4(r1)
    0x07439108: lwz r14,-8(r1)
    0x0743910c: lis r3,1819
    0x07439110: ori r3,r3,36890
    0x07439114: b 0x826ead0
    0x07439118: lwz r14,64(r13)
    0x0743911c: lwz r3,68(r13)
    0x07439120: rlwimi r14,r3,16,26,26
    0x07439124: lwz r3,4(r13)
    0x07439128: li r4,1
    0x0743912c: addco. r3,r3,r4
    0x07439130: mcrxr cr2
    0x07439134: mfcr r14
    0x07439138: rlwimi r14,r14,16,26,26
    0x0743913c: lwz r4,32(r13)
    0x07439140: addi r5,r4,384
    0x07439144: extsh. r0,r3
    0x07439148: mfcr r6
    0x0743914c: rlwimi r14,r6,0,0,2
    0x07439150: rlwinm r14,r14,0,11,8
    0x07439154: stw r3,4(r13)
    0x07439158: mr r4,r3
    0x0743915c: mr r3,r5
    0x07439160: rlwinm r0,r3,18,14,29
    0x07439164: lis r5,109
    0x07439168: ori r5,r5,43688
    0x0743916c: lwzx r5,r5,r0
    0x07439170: lwz r5,16(r5)
    0x07439174: mtlr r5
    0x07439178: blrl
    0x0743917c: stw r14,64(r13)
    0x07439180: sth r14,68(r13)
    0x07439184: lis r3,1819
    0x07439188: ori r3,r3,36896
    0x0743918c: stw r3,76(r13)
    0x07439190: li r3,24824
    0x07439194: lis r4,105
    0x07439198: ori r4,r4,43384
    0x0743919c: bl 0x80dc1f0
    0x074391a0: stw r14,64(r13)
    0x074391a4: sth r14,68(r13)
    0x074391a8: li r3,6656
    0x074391ac: bl 0x826ed40
    0x074391b0: lwz r1,0(r1)
    0x074391b4: lwz r0,8(r1)
    0x074391b8: mtlr r0
    0x074391bc: lwz r13,-4(r1)
    0x074391c0: lwz r14,-8(r1)
    0x074391c4: blr
    End of assembler dump.

    All the branches to absolute addresses are incorrect. Any ideas?

    Thanks for your great effort so far!
    Tobias

    ReplyDelete
  2. Well, not all the absolute branches are incorrect; the local branch "beq+ 0x7439118" is correct.
    So it branches at 0x074390f4 to 0x7439118, continues until 0x07439178 where it branches to custom_wput. After returning from custom_wput it continues at 0x0743917c and continues until 0x0743919c where it branches to some code of some other dynamic library where sooner or later crashes.

    The console log shows:
    JIT: Flush icache soft (1/0/0x0)
    JIT: Flush icache hard (soft cache flush/0/0x0)
    JIT: Flush icache soft (1/0/0x0)
    JIT: Flush icache hard (soft cache flush/0/0x0)
    JIT: Flush icache soft (1/0/0x0)
    JIT: Flush icache hard (soft cache flush/0/0x0)
    JIT: Flush icache soft (1/0/0x0)
    JIT: Flush icache hard (soft cache flush/0/0x0)
    JIT: Flush icache soft (1/0/0x0)
    JIT: Flush icache hard (soft cache flush/0/0x0)
    JIT: Flush icache soft (1/0/0x0)
    JIT: Flush icache hard (soft cache flush/0/0x0)
    JIT: Flush icache soft (1/0/0x0)
    JIT: Flush icache hard (soft cache flush/0/0x0)
    JIT: Flush icache soft (1/0/0x0)
    JIT: Flush icache hard (soft cache flush/0/0x0)
    JIT: Flush icache soft (1/0/0x0)
    JIT: Flush icache hard (soft cache flush/0/0x0)
    Building CPU table for configuration: 68020 prefetch
    1866 CPU functions
    Building CPU function table, 45674 opcodes (2 1 0).
    JIT: Building compiler function table.
    JIT: Allocation of translation cache...
    JIT: Translation cache size in prefs: 8192
    JIT: Allocated 8192 KB translation cache.
    JIT: Flush icache soft (1/0/0x7239000)
    JIT: Flush icache hard (soft cache flush/0/0x7239000)
    JIT: Flush icache soft (1/0/0x7239000)
    JIT: Flush icache hard (soft cache flush/0/0x7239000)
    JIT: Flush icache soft (1/0/0x7239000)
    JIT: Flush icache hard (soft cache flush/0/0x7239000)
    JIT: Flush icache soft (1/0/0x7239000)
    JIT: Flush icache hard (soft cache flush/0/0x7239000)
    JIT: Flush icache soft (1/0/0x7239000)
    JIT: Flush icache hard (soft cache flush/0/0x7239000)
    JIT: Flush icache soft (1/0/0x7239000)
    JIT: Flush icache hard (soft cache flush/0/0x7239000)
    JIT: Flush icache soft (1/0/0x7239000)
    JIT: Flush icache hard (soft cache flush/0/0x7239000)
    JIT: Flush icache soft (1/0/0x7239000)
    JIT: Flush icache hard (soft cache flush/0/0x7239000)
    JIT: Flush icache soft (1/0/0x7239000)
    JIT: Flush icache hard (soft cache flush/0/0x7239000)
    reset at 0
    Kickstart version is below 1.3! Disabling autoconfig devices.
    JIT: Flush icache soft (1/0/0x7239000)
    JIT: Flush icache hard (soft cache flush/0/0x7239000)
    JIT: Compiling reset
    JIT: Change cache emulation: disabled
    PAL mode, 50Hz (h=227 v=312)
    chipmem cleared
    JIT: Change cache emulation: enabled
    JIT: Flush icache hard (cache state change/8/0x71b900e)
    JIT: Flush icache soft (1/8/0x71b900e)
    JIT: Flush icache hard (soft cache flush/8/0x71b900e)
    JIT: JIT: compile code, pc: 00000008, block length: 7, total cycles: 15872
    JIT: Compiling enabled, block length: 7
    JIT: JIT: compile code, pc: 0000001a, block length: 3, total cycles: 6656
    JIT: Compiling enabled, block length: 3
    JIT: JIT: compile code, pc: 0000001a, block length: 3, total cycles: 6656
    JIT: Compiling enabled, block length: 3
    JIT: JIT: compile code, pc: 0000001a, block length: 3, total cycles: 6656
    JIT: Compiling enabled, block length: 3
    JIT: JIT: compile code, pc: 0000001a, block length: 3, total cycles: 6656
    JIT: Compiling enabled, block length: 3
    JIT: JIT: compile code, pc: 0000001a, block length: 3, total cycles: 6656
    JIT: Compiling enabled, block length: 3
    JIT: Init compiling
    JIT: Compiled code start: 074390c8
    JIT: Comp: 0000001a 5281 ADD.L #$00000001,D1
    JIT: Comp: 0000001c 3141 0180 MOVE.W D1,(A0,$0180) == $00dff180
    JIT: Comp: 00000020 60f8 BT.B #$fffffff8 == 0000001a (TRUE)
    JIT: Unsupported opcode: 0x60f8
    JIT: Done compiling

    ReplyDelete
    Replies
    1. Hey Tobias,

      Thanks for spending time on the Mac support. I hope you will succeed for the benefit of every PPC Mac user.

      The library function you are referring to is this one: do_cycles_callback(), it calculates the state of the used CPU cycles after the compiled block was executed. As far as I can tell that function does not do anything complex, but probably it is depending on some (more) registers than the custom_wput() function.

      Have you checked the register layout? If the ABI is different then you have to make sure that the non-volatile registers are preserved. Have a look on this file:
      http://sourceforge.net/p/euaeppcjit/code-0/18/tree/trunk/src/include/compemu.h
      From line 75 you can see the registers.

      If you turn on the comp_log_compiled option in the config file then you can get more details about the compiled code.

      Delete
    2. Ah, so "bl 0x80dc1f0" should branch to do_cycles_callback()? The calculated address is wrong; it ends up jumping into foreign code.

      I had also changed the register layout as you might see in the generated code as well; r13 is the first preserved register on Darwin and r2 is available for general use (that you don't see in the code).

      I'll investigate somewhat further, thanks!

      Delete
    3. iamalive.kick is working!
      The branch target address calculation was working for you because obviously on your OS dynamically allocated memory has lower addresses than the loaded executable files.
      I had to add a check for correcting the calculated address offset if allocated memory has higher addresses, like in the following snippet:

      //Calculate the offset to the target from the actual PC address
      uae_u32 offset = ((uae_u32) addr) - ((uae_u32) current_compile_p);
      + if ((uae_u32)addr < (uae_u32)current_compile_p)
      + offset = ~offset;

      So, iamalive is working with JIT and Kickstart 3.1 hangs after "B-Trap f201" with a completely pink screen (is that the expected behaviour until now?).
      BUT no matter if JIT is enabled or not (by setting cachesize to 0) mandel_hw only shows a completely green screen and mandel_though_hw a completely browny orange screen.
      In my uaerc I have only set cpu_type (to at least 60ec020) kickstart_rom_file and cachesize.

      Delete
    4. Mandels are now drawing correctly; they need AGA chipset.
      I'd like to compare benchmark results; what config settings are you using?

      Delete
    5. I am glad to hear that you managed fixing up the emulator.

      The settings wouldn't matter much, just give it enough cache and turn off the logging. I have tested whether the direct memory access speeds up the test, but it has not that much effect on it.

      I measured the time from the second when the first line appears until it is finished the last line.

      I would like to hear about the speed. :)

      Delete
  3. @tobias
    Dont forget to donate to the bounty.
    http://amigabounty.net/index.php?function=viewproject&projectid=35

    ReplyDelete
  4. Speed numbers are as follows on a PowerBook G4 12" 1.5 GHz, running OS X 10.5.8:
    (Drawing has always been the bottleneck for UAE on OS X. The latest E-UAE came with experimental OpenGL support.)
    SDL OpenGL
    2:40 1:01 Interpretive
    1:17 0:29 JIT compiled without optimization
    0:50 0:18 JIT compiled with optimization

    I had sound emulation turned off, cpu speed set to maximum, emulating a 68EC020, 1024 kB JIT cache.
    Using a big cache size like 32 MB slows the JIT down significantly, maybe because then relative branches cannot be used?

    ReplyDelete
    Replies
    1. Hm, the SDL version is really slow, but the OpenGL is doing very well.

      I have no explanation why would the 32 MB cache size slow down the emulation. Probably you are right about the relative branches: the allocated memory is farther away from the main executable. But my experience is that switching off the relative branches with address loading and branching has less impact on the overall speed of the PPC code. Probably it depends on a number of factors, like memory handling and processor type, though.

      We can test this by turning off the relative branching and examine the results. Just comment out lines between 2050-2055 and 2084-2090 in this source:

      http://sourceforge.net/p/euaeppcjit/code-0/18/tree/trunk/src/compemu_support.c

      Delete
    2. If you could send me a patch file then I would include the Mac support to the sources.

      Delete
  5. This blog is really informative i really had fun reading it.

    ReplyDelete