This project is dedicated to the memory of William Morris (aka Frags), who was the main contributor to the bounty but was unable to see the final result.

Sunday, July 1, 2012

Lies, Damned Lies, Statistics and Benchmarks

Another nice batch of instructions implemented in the recent update. This is an important milestone: most of the instructions are JIT compiled now, that are used in the Mandelbrot test. Why is it important?

Because finally we can do some benchmarking!

Source: http://wumocomicstrip.com/2012/04/16

 

The update


So, the changes in this update are:
  • Implementation of
    ADD.x #imm,Dy,
    ADD.x regy,Dz,
    ADDA.x regy,Az,
    ADDI.x #immq,Dy,
    AND.x #imm,Dy,
    ANDI.x #immq,Dy,
    ASL.x #imm,Dy,
    ASR.x #imm,Dy,
    BTST.L #imm,Dx,
    CMP.x #imm,Dy/Ay,
    CMPA.W #imm,Ay,
    CMPI.x #immq,Dy,
    EOR.x #imm,Dy,
    EORI.x #immq,Dy,
    LSR.x #imm,Dy,
    MULS.W Dx,Dy,
    MULU.W Dx,Dy,
    ORI.x #immq,Dy,
    ROL.x #imm,Dy,
    SUB.x regy,Dz,
    TST.x reg instructions.
  • Fixed OR.x reg,reg flag dependency.
  • Recycling of mapped temporary registers is implemented.
  • Added the possibility of inverting C flag after extracting.
  • CMP.L #imm,Ax and CMP.W #imm,Ax were removed from the table, there are no such instructions.

Phew, that was a nice long list.

Recycling of the temporary registers

While I was running amok on implementing the selected instructions for the Mandelbrot test, I had to face the inevitable: the shortage of the temporary registers. When I implemented the register allocation subsystem I tried to postpone the fix for this problem, but since a lot more instructions are implemented the consistent translated chunks are getting longer. Suddenly, I got the error message that I wrote a couple months ago:

“Error: JIT compiler ran out of free temporary registers”

Probably this error needs some explanation. Let’s start with how the registers are handled in the JIT engine:

In Petunia the emulated 68k registers are statically mapped to PowerPC registers. This approach has its benefits, like constructing the code that handles the emulated registers is much more easy. But also has some major issues, especially because all registers must be up-to-date at anytime. Whenever the execution is leaving the compiled code all registers must be saved and restored to the previous state. Needless to say how long that takes: more than 20 registers must be saved.

In E-UAE this would be a no-go. The compiled code consists of smaller contiguous chunks and the support functions from the environment must be called quite often. The rest of the emulator was written in C and it expects a certain state of the registers at a time (which is described in details in the PowerPC SysV ABI document).

How to overcome of these issues? First, I realized that there is no point in keeping all the contents of the emulated registers in real registers all the time. The usage of certain emulated registers are localized in a certain range of the code that consists of closely tied instructions mostly.

Then I figured out from the SysV ABI that I can safely use a range of registers that needs no saving/restoring and also I can make use of the non-volatile registers that must be restored, because other called functions must restore that too.

I ended up implementing a dynamically allocated register serving system that can handle the needs of the compiling.

There are two types of registers: the temporary registers that are used in one instruction only for a specific role and the mapped 68k registers that are loaded once and kept until the registers must be released because the execution leaves the compiled code.

Back to the error message: eventually the free registers are running out because the compiler keeps the mapped 68k registers between instructions. I had to figure out some solution for enforcing releasing a previously mapped register when there are no more free registers left. Also, I had to make sure that I wouldn’t release the register that was allocated just recently and needed for the actual instruction compiling (or for the following instructions that might depend on the same 68k register).

This issue was resolved by some simple locking mechanism for the mapped registers: when a register is mapped then in the same time it is locked for the actual instruction. At the end of an instruction all mapped 68k registers are unlocked but not released.
On enforced releasing of the registers I look for an unlocked mapped register and release it.
But I start searching for the next unlocked register in a round-robin manner, to make sure I won’t release-allocate-release the same register all the time.

This system looks robust enough to handle the needs of the compiling.

Back to the lies… errr… benchmarks

As I already mentioned: in the implementation I reached the point when almost all instructions are implemented that are needed by the Mandelbrot test. So let’s see how the compiled code performs.

The test system consists of a Micro AmigaOne (G3/800 MHz), a slightly outdated AmigaOS4 version and my beloved Galaxy S2 who acts as a stopwatch. (Yes, I measure the time by using stopwatch. I didn’t want to spend too much effort on figuring out how to measure the time more precisely.)

Running the Mandelbrot test, I got the following results:
 
Interpretive: ~6 seconds,
JIT compiled: ~3 seconds.


Not too precise, but as it seems the JIT compiler is doing nice, the compiled code finished twice as fast as the interpretive.

To confirm the results I created a new version of the Mandelbrot test that needs lot more processor power to complete. You can find it among the test kickstarts, it is called: mandel_though_hw.kick. The code is essentially the same as the previous Mandelbrot test, but I tweaked the parameters a bit: zoomed in and adjusted the drop out threshold, so the picture is more detailed.

Honey! The Mandelbrot is almost done in the oven!
The results are:

Interpretive: 108 seconds,
JIT compiled: 52 seconds.


Yay! It is indeed twice as fast! Rejoice!

Disclaimer

Why we must not jump to any conclusions just yet regarding the speed...

First of all, not all instructions are implemented yet. This most likely means more speed increase, because the JIT instructions are always faster or at least as fast as the interpretive implementation. When the emulation hits an not implemented instruction then it has to call the interpretive implementation, which means essentially storing all the emulated registers in the memory and restart the register mapping.

Also, the register flow optimization is not implemented yet, that will be certainly a big boost on the speed, it will eliminate lots of instructions from the compiled code.

On the other hand, this test relies more on the rather simple, mostly processing-demanding instructions. When the test code is doing much more memory accessing then the difference between the interpretive and the JIT compiled will be almost certainly much smaller.

Anyway, I hope I assured all of you that we are heading to the right direction.

Full speed ahead! (Don’t mind the mines.)

25 comments:

  1. I tried to compile it on morphos and after some configure hickups I was able to do it. With JIT, the emulator crashes with normal kickstart and a game. Maybe I can find out, why this happens. Next time I try your mandelbrot test.
    Nice implementation if you earned apprx 50% for now. Respect. Even if it would shrink to 30% this is still great.

    ReplyDelete
    Replies
    1. Yes, that is a known and long outstanding issue.

      I would be very grateful if you could find any hint of why the crash/hang happens. I have spent lots of time on it already with no results at all.

      You might want to try to turn off the instruction compiling, in this case the possible bugs in the compiled code can be eliminated.

      Just change this line (557) in compemu_support.c:

      if (props->instr_handler != NULL)

      to

      if (false)

      The compiler detects all instructions as unsupported and compiles a simple function call instead to the interpretive emulation for that instruction.

      Delete
    2. If I do that, I get:

      error: 'false' undeclared (first use in this function)

      I'm using the previous SVN commit at the moment. I'm supposing 0 and false should be the same?

      Delete
    3. To clear up any confusion, I am not the "anonymous" above.

      Delete
  2. Very exciting!
    A question for you - there are certain implementations of the PPC that have ever so slightly different instruction sets and registers. Is this an issue for your dynamic register serving implementation? (I couldn't work out if you're handling the PPC registers directly, or if you're leaving it up to the compiler)

    I'm specifically talking about the Cell processor used in the PS3 - though I expect in most ways this doesn't present an issue.

    ReplyDelete
    Replies
    1. The register allocation is flexible enough to overcome of the different register sets. The compiler is not involved directly in the code generation. I am not familiar with the Cell processor, maybe I can answer your question if you could elaborate (some of) the differences.
      The actual implementation depends on the SysV ABI as I mentioned in the post, for a different ABI most likely it needs some adjustment at least.

      There might be more problems if the flag handling is significantly different or some instructions are not available or behaving differently than what is defined in the generic PowerPC 32 bit specifications. But I guess the code emitter can be adjusted for certain needs with some work.

      Delete
  3. Okay, it seems that the initial "illegal instruction" let the application hang/crash. If you switch to 68000 it will go through. On a 68020 it will crash without showing "illegal instruction", but B-Trap.
    I found the code in newcpu.c and commented the B-Trap out for testing purposes. And then the emulator did not crash anymore, but just let the system to stop booting. Everything else still worked. Maybe somewhere in the illegal op routines there is the bug?
    I found out that it went 10 times through your mentioned code in compemu_support.c and does not crash within this code, but after leaving it for the 10th time.

    ReplyDelete
    Replies
    1. There is no JIT for 68000, it needs 68k cache. Check out the FAQ, it is explained there. (You can find it among the links to the right.)

      The B-Trap is normal behavior, if you turned off the JIT compiling (set the cachesize configuration to 0) then you will still get the B-Trap. The Kickstart is trying to do some trick that involves a trap.

      It won't crash if you change the memory access emulation to indirect (configuration: comp_trustbyte, comp_trustword, comp_trustlong = indirect). There must be something wrong with the memory mapping.

      But even with these settings the booting stops, it is waiting for an event from the hardware emulation (CIA probably) that is not coming in.

      Delete
    2. Okay thank you for this hint.

      Delete
  4. I just edited this post slightly. When I re-read it I have found some annoying mistakes and way too complex sentences. I hope it is a bit more clear now.

    (I know most of the readers are interested in that four numbers only at the middle... ;)

    ReplyDelete
  5. Btw, your function ppc_cacheflush is not checked in yet. For my testing, I commented this line. The mandelbrot tests only show a colored frame but no graphics at all :( Hope I can fix it.

    ReplyDelete
    Replies
    1. It is checked in, but only implemented for AmigaOS4, because I have no idea how it should be done on any other systems.
      Check out this line in the SVN:
      https://sourceforge.net/p/euaeppcjit/code-0/15/tree/trunk/src/od-amiga/memory.c#l36

      If you knew how to implement it on other systems then please send me a patch for it. Thanks!

      Delete
    2. Yes, thank you, the cacheflush code should also work for MorphOS, at least I can run the mandelbrot with it. I recognized that custom chip access will let the JIT hang or crash. The iamalive.kick also works only without JIT, so I think writing to addresses like DFF180 or other custom chip addresses will crash it.

      Delete
    3. Good news! I think I found your bug, or at least where it comes from!
      Check the code in newcpu.c at 1041, you will see your JIT define block. This block prevents the emulator to boot the kickstart. If you comment this block out, the emulator will boot. Also the iamalive.kick works then.
      The problem is using the cacr register. If there is a command Move ?,cacr then it will hang. Somewhere in this part is the problem (flush_icache or something like that?)

      Hope this information will let you fix the bug.

      Delete
    4. Move ?,cacr command turns on the cache, that activates the JIT. If you comment out those lines then the cache will remain turned off and there is no JIT. This is why you were able to boot... :)
      But it is really interesting why the iamalive.kick hangs for you. It sounds like the cache flushing is not working at all. Have you tried to remove the checking for the AmigaOS4 define around the PPC cache flush yet? Without flushing the code cache it is almost certain that the compiled code won't work.

      Delete
    5. Okay didn't know that enabling the cache in the emulated machine, will enable the JIT. This means, the AROS kick file will not use JIT until a program will enable it?
      Very strange I got random crashes. Often it hangs with JIT but not every time. It's not the ppc_cacheflush. The whole emulator crashes if I use a cachesize of e.g. 8192. If I use 1000 then it will work with mandelbrot. Maybe it's memory fragmention?

      Delete
    6. When the instruction processor cache is not invalidated then the previous instructions remain in it. It is very hard to predict the outcome of a situation like that.
      Probably when you specified a different cache size then the memory was allocated from a different address region that wasn't used before, therefore the instructions from that area were not cached.

      Delete
  6. Dont use CacheClearE() function in MorphOS because it only clears 68k CPU caches (JIT). PPC caches are controlled using following functions:

    * CacheFlushDataArea
    * CacheFlushDataInstArea
    * CacheInvalidDataArea
    * CacheInvalidInstArea
    * CacheTrashDataArea

    See Exec autodocs for more information. CacheFlushDataInstArea is probably a call to make.

    ReplyDelete
    Replies
    1. Thanks for the suggestion, as it seems this is the solution what we are looking for.

      Somebody, who has access to MorphOS and can build the app: could you please give it a try and tell us how it works? Thanks.
      You can find the required function ppc_cacheflush() at od-amiga/memory.c line: 36

      How about aligning the allocated code cache to the processor cache? (See od-amiga/memory.c line: 28 for the AmigaOS4 solution.)

      One more thing: you can make sure that the JIT is on by turning the JIT logs on. Add these to the config:

      comp_log=true


      When the code JIT compiled it appears in the log.

      Delete
    2. And one more thing: on AmigaOS4 the memory for executable code must be allocated with the MEMF_EXECUTABLE flag. I am almost certain that there must be something similar requirement on MorphOS.

      Delete
  7. Memory is allocated using MEMF_ANY because the NX bit is not used in MorphOS. To align the allocated code block you have to query L1 instruction cache size:

    #include
    #include

    int main(void)
    {
    ULONG cachelinesize;

    if (NewGetSystemAttrsA(&cachelinesize, sizeof(cachelinesize), SYSTEMINFOTYPE_PPC_ICACHEL1LINESIZE, NULL))
    {
    printf("L1 instruction cache line size is %d bytes.\n", cachelinesize);
    }

    return 0;
    }

    On my machine this would print "L1 instruction cache line size is 32 bytes.".

    To allocate aligned memory block you call AllocMemAligned() or AllocVecAligned() or even AllocPooledAligned():

    void *ptr = AllocVecAligned(size, MEMF_ANY, l1_icache_linesize, 0);

    Last parameter is the align offset if you need to store additional data before the aligned memory area begins.

    ReplyDelete
    Replies
    1. Thanks, this was helpful. I will prepare the functions and update the SVN repository.

      Delete
    2. This was not me, this was a second Anonymous. I don't want to brag in his name :) So many thanks!
      I included the code into my test uae for MorphOS and voilá it works!
      For my system I used the 32 bytes hardcoded, but for release this should be determined at startup.
      The two functions to be edited are (note that I left the OS4 parts away):

      void * cache_alloc (int size)
      {
      return AllocVecAligned(size, MEMF_ANY, 32, 0);
      }

      void ppc_cacheflush(void* start, int length)
      {
      CacheFlushDataInstArea(start, length);
      }

      This worked for mandelbrot and iamalive. Really good. Thank you!

      Delete
    3. Thanks for the confirmation.

      Too many Anonymous is hanging around... Don't you guys want to be famous? :)

      Delete
    4. Anonymous #1: Profit for all is more worth than one or two persons to be famous :) I posted my name on the other Thread, but only for "disanonymous" reasons, not for fame.

      Delete