This project is dedicated to the memory of William Morris (aka Frags), who was the main contributor to the bounty but was unable to see the final result.

Tuesday, May 15, 2012

Peeking behind the curtain


What's new

I have just uploaded a small update again to the Subversion repository. In this update you can find the following changes:
  • Implementation of all form of the following addressing modes for both as source and as destination addressing:
    • immediate long, word, byte;
    • indirect addressing: (Ax);
    • indirect addressing with pre-decrement: -(Ax);
    • indirect addressing with post-increment: (Ax)+;
  • Implementation of instructions:
    • MOVE.x #imm,Dy;
    • MOVEA.x #imm,Ay;
    • MOVEQ #imm,Rx;
    • MOVE.x Rx,Ry
  • Code was refactored to avoid repetition of the same code chunks in the addressing modes and instructions.
  • Implementation dumping the compiled code to the console.
  • New configuration (comp_log_compiled) is also added to turn it on/off.

Featuring...

It is always exciting to find out some magic details about the internal behavior of a complex system. I remember when I realized for the first time how the texts are stored in the Commodore Plus/4 games, back in the days. I was amazed how I can change the text that showed up on the bottom scroll of Tycoon Tex.

Let me offer some small excitement to you all: I just implemented a funny little feature in the E-UAE JIT engine. Now we can turn on dumping the compiled code to the console, together with the original Motorola 68k instruction that was compiled and the macroblocks that describe the intermediate translation form.

The purpose of this feature wasn’t (purely) the entertainment, but I was really fed up with the situation that the generated code cannot be debugged properly. Previously, I added a trap instruction (tw) into the translated code at some point, so I was able to have a look on the output from the Grim Reaper window (which was awesome, but let’s not mix it up with actual debugging).
Too bad that GDB is so limited: it cannot debug into any code segment that wasn’t loaded by DOS (like generated code). Not to mention how cumbersome the console interface is... (Or am I missing something? Enlight me please.)

I would like to thank Frank Wille the sources for the PowerPC disassembler that makes it possible to list the translated code.

How to turn on this feature: there are two settings that control the logging. These are:
  • comp_log – if it was set to “true” or “yes” then the JIT logging is turned on and dumped to the standard output.
  • comp_log_compiled – if it was set to “true” or “yes” then the compiled code is listed through the JIT logs.

Let’s see a small demonstration of this feature, shall we? (Not for the faint-hearted!)

The following list is the output from the very simple test code: iamalive.asm, slightly edited and formatted for educational purposes...
  1. M68k: ADD.L #$00000001,D1
    1. Mblk: load_memory_long
      Dism: lwz r15,64(r14)
      Mblk: load_memory_long
      Dism: lwz r3,68(r14)
      Mblk: rotate_and_copy_bits
      Dism: rlwimi r15,r3,16,26,26
    2. Mblk: load_memory_long
      Dism: lwz r3,4(r14)
    3. Mblk: load_register_long
      Dism: li r4,1
    4. Mblk: add_with_flags
      Dism: addco. r3,r3,r4
    5. Mblk: copy_nzcv_flags_to_register
      Dism: mcrxr cr2
      Dism: mfcr r15
      Mblk: rotate_and_copy_bits
      Dism: rlwimi r15,r15,16,26,26
  2. M68k: MOVE.W D1,(A0,$0180) == $00dff180
    1. Mblk: load_memory_long
      Dism: lwz r4,32(r14)
    2. Mblk: add_register_imm
      Dism: addi r5,r4,384
    3. Mblk: check_word_register
      Dism: extsh. r0,r3
      Mblk: copy_nz_flags_to_register
      Dism: mfcr r6
      Mblk: rotate_and_copy_bits
      Dism: rlwimi r15,r6,0,0,2
      Mblk: rotate_and_mask_bits
      Dism: rlwinm r15,r15,0,11,8
    4. Mblk: save_memory_long
      Dism: stw r3,4(r14)
    5. Mblk: save_memory_spec
      Dism: mr r4,r3
      Dism: mr r3,r5
      Dism: rlwinm r0,r3,18,14,29
      Dism: lis r5,27315
      Dism: ori r5,r5,23016
      Dism: lwzx r5,r5,r0
      Dism: lwz r5,16(r5)
      Dism: mtlr r5
      Dism: blrl
  3. M68k: BT.B #$fffffff8 == 0000001a (TRUE)
    1. Mblk: save_memory_long
      Dism: stw r15,64(r14)
      Mblk: save_memory_word
      Dism: sth r15,68(r14)
    2. Mblk: load_register_long
      Dism: lis r3,27606
      Dism: ori r3,r3,45096
      Mblk: save_memory_long
      Dism: stw r3,76(r14)
    3. Mblk: opcode_unsupported
      Dism: li r3,24824
      Dism: lis r4,27315
      Dism: ori r4,r4,21752
      Dism: bl 0x7f91acc0

  4. Done compiling
Colorful, isn't it? :)

Okay, let's try to understand what is going on.

I marked the three Motorola 68k instruction that was compiled here with orange color, the code roughly looks like this:

1. Increase register D0 by one;
2. Put the content of register D0 to the address that is calculated by using register A0 plus offset of 0x180 (A0 was initialized previously with the value: 0xDFF000, which is the base of the custom chipset memory area) - in layman terms: load it to the background color.
3. Go back to step 1.

Now, let's see the second level of the list:

First of all the prefix "Mblk:" marks the macroblocks (white), "Dism:" is the actual PowerPC code (yellow).
As I already mentioned earlier: some macroblocks can be optimized away (although it is not implemented yet), and a macroblock means at least one PowerPC instruction, but it can be a series of instructions also.

The steps can be interpreted roughly as:

1.1. Load the arithmetic flags from the memory where the interpretive emulator stores them.
1.2. Load the previous content of the emulated D0 register into a PPC register.
1.3. Load the constant for the add instruction (one) into a PPC register.
1.4. Add the second register to the first one (increase D0 by one).
1.5. Save the arithmetic flags after the operation.

2.1. Load the previous content of the emulated A0 register into a PPC register.
2.2. Add the offset (0x180) to the content of A0 and load it into a new PPC register.
2.3. Check the content of the emulated D0 register to set up the arithmetic flags according to it.
2.4. Save back the modified D0 register to the memory for the interpretive emulator.
2.5. Calculate the offset and load the function address for the memory write operation handler and call it (namely the custom chipset write handler). This is a function from the interpretive emulation and it was written in C, therefore we must store all volatile registers back to the memory, the C code won't preserve these. (This is why we stored the D0 register in step 2.4.)

3.1. Save the arithmetic flags back to memory where the interpretive emulator stores them. (These were kept in a non-volatile register, so these were preserved while we called the helper function in step 2.5.)
3.2. Update the emulated PC register to the current state for the following instructions.
3.3. Call the interpretive emulation for the branch instruction (because it is not implemented yet, so we reuse the interpretive implementation).

4. Done. Phew.

Funny, eh? :)
If you are not familiar with assmebly then don't stretch yourself too much by trying to understand this techno-blahblah.

For the rest: who can spot what can be optimized on the compiled code?

4 comments:

  1. Thumbs up for techno-blahblah !

    Btw, its me, or PPC code in end are much larger/longer in compare with 68k originals ?

    ReplyDelete
    Replies
    1. Usually it is longer, at least a couple instructions needed to emulate the exact same behavior of the 68k instruction. But don't forget how much faster the PPC is compared to the fastest 68k. Even if we assume that the 68k instruction can be executed in 1 CPU clock cycle, a 800 MHz PPC would be 16x faster than a 50 MHz 68k. This is not true of course, but there are other factors, like the speed of the memory access which is much better on a newer hardware than the old Amigas obviously, or the number and size of the caches in the processor.

      Delete
  2. Nice. Keep up with the good work.

    I know we are talking about UAE, but could be possible to realize an entire emulator (A sort of VM) that have all the chipset jit emulated? I mean: We have petutia... but what about to add the blitter and the other chipset components to act in the same way as petunia?

    Thank you very much.

    ReplyDelete
    Replies
    1. Interesting question. I have never investigated why the chipset emulation is slow. I assume it is the huge amount of data that must be copied to the screen for each frame. I don't think it is the implementation of the components.
      Why the JIT helps a lot in the processor emulation: for each instruction a significant amount of static analysis is needed before the exact operations. This is done by the JIT compiler once and the optimized operation can be reused next time.
      The other hardware components are fairly simple compared to the processor. I don't expect a major speed improvement by doing some form of JIT for these.

      But it all boils down to the implementation, and I am not aware of the details. There is some chance that it can be improved by some form of precompiled instructions, if it wasn't done previously.

      Delete