This project is dedicated to the memory of William Morris (aka Frags), who was the main contributor to the bounty but was unable to see the final result.

Wednesday, November 7, 2012

Flag yeah

I am slowly working my way through the unimplemented instructions. Another bunch is done, here are the details of the recent update:
  • Implementation of flag condition "addressing" modes.
  • Implementation of ST.B Dx, SF.B Dx, Scc.B Dx, ST.B mem, SF.B mem, Scc.B mem, PEA.x, MOVE.x #imm,mem, CLR.x Dy, CLR.x mem and LEA instructions.
  • Fixed BTST instruction: testing bits higher than 15 was wrong.
  • Merged multiple condition code-related lines in the 68k instruction descriptor table.
  • Removed unnecessary parameter load for cache_miss function from the translated code PC verification code.
  • Added cache miss check for normal execute handler.
  • Code cleanup: removed the TODO label from immediate addressing modes used as destination and added meaningful error message instead.
  • Removed ignored parameter from the unsupported opcode macroblock push function.
The highlight of this change set is the implementation of the flag checking "addressing" modes. These are not real addressing modes; more like a simple way of implementing the numerous conditional instructions which are checking the arithmetic flags.
This important change opens the gate for the conditional branching instructions (Bcc and DBcc), that are essential for any average loops and iterations. For now only the conditional set instructions are implemented, because it was much more easy to test these.

I also took some weekday nights for dealing with some simple instructions, like LEA, PEA and CLR. On the working days I am too tired, this is all I can afford.

There is still no fix for the OS booting issue. I have tried to trace it again, at least now found out it is not about the simulated cache manipulation, because there is no cache-incoherency detected. I am still puzzled by this whole problem.

Tuesday, October 23, 2012

Address Unknown

Finally all the normal addressing modes are done in this update. As I mentioned in the previous post: the final missing pieces were the 68020 (complex) addressing modes.
This took me lot longer than I expected. Not just because these addressing modes are complicated, but also because memory reading is involved.

Secretly I hoped that this will fix the issue with the OS, but no. I am not sure why I expected that. :P
The OS still behaves the same: reboot loop.

You can stop reading now, technical details will follow.

Why does memory access make everything much more complicated?

The answer is: every memory access has to go through the memory handling functions in the emulator. I already implemented the possibility of direct memory access, but we cannot depend on that every time (or rather most of the time).

Then why is that causing any headaches? Because the functions are outside of the translated code, coded in C and all assigned temporary registers will be gone.

Previously, for the MOVE instructions I workarounded the situation by remapping and reloading only the required temporary registers as soon as the memory operation was done. This approach is simply not working in the case of the addressing mode translation: the addressing modes are completely isolated subroutines, do not know any details of the already allocated and mapped registers (that should be restored).

For now I implemented a different workaround for the addressing modes, which is far less optimal: instead of dropping all the temporary registers, I save them on the stack then restore it when the execution returns to the translated code.

No need to say how much slower this is: saving and reloading all mapped temporary registers than dealing with the absolutely required registers only. I am not satisfied with this solution, but at least it is working. Unless the called function expects a proper stackframe, because there won't be any. (At least this is not the case for OS4, and most likely not for MorphOS either.) This solution is not strictly compatible with the SysV ABI.

Probably it would make more sense to store the temporary registers in the static context structure, there is no need to be reentrant: there are no call backs to the translated code. Maybe I can revisit this whole piece of code at a later stage.

Monday, October 8, 2012

Jinxed it (more Apples)

Speaking of apples. As it seems I managed to upset the Gods with the previous post somehow: they unleashed their wrath on me. Apple had released iOS6.

The funny part was: the developer pre-release was all hunky dory with our app, but then came the final release and things have changed dramatically overnight. Since I am working as a mobile app developer recently, I had to spend lot of time on it, even some weekend day(s).Okay, I stop grumbling, we are finally managed to take everything under control. It was stressful and annoying.

So, I guess you might guess why the recent update is so thin: when I finally managed to get home I was fully drained, I had no strength to look at one more line of code.
I have implemented three missing addressing mode which are included in this update. I also started to work on the complex addressing modes (also called 68020 addressing modes), which are... well... really complex. I am less than half way thru with them, but in the meanwhile I didn't want to hold back this:

G5 support

While I suffered in deep agony, luckily Tobias Netzel was busy again and managed to fix up E-UAE for the G5 (PowerPC 970) processors. He got rid of the mcrxr emulated unsupported instruction (see the chapter about mcrxr) for this processor type. He also done some optimizations regarding the microcoded instruction usage, but probably we need to do more about it once.

The situation was very much similar to the 68060 and some unsupported FPU instructions which were emulated by the OS. If you remember that time how much Oxypatcher increased the speed of some floating point calculation intensive apps you can understand why E-UAE was soo sluggishly slow on G5 before: almost half of the emulated instructions make use of the mcrxr instruction for emulating some arithmetic flags.
With the recent changes this PPC instruction is not used for anything if the emulator was compiled for G5 and the fix helps a lot on the interpretive emulation too.

Some benchmarks from Tobias using the Mandelbrot test (G5 - 2.1 GHz):

Interpretive emulation:
  • using mcrxr: 5:02 secs
  • without mcrxr: 59 secs
JIT without flag optimization:
  • using mcrxr: 7:46 secs (yes, even slower than the interpretive...)
  • without mcrxr: 27.5 secs
JIT with flag optimization:
  • using mcrxr: 1:47 sec
  • without mcrxr: 18 sec
Well done, Tobias! The PPC Mac users will be grateful again.
By the way, these changes might have affect on the Cell PPE and Xenon processors too. Any volunteering developers?

Bounty

Recently, I had a cautious look on the bounty page for the project and I was a bit shocked by the fact that Mr. William Morris donated 1000 EUR. That was half of the previously collected bounty. Very generous of you.

Maybe this post is not the best opportunity to thank you all for your support. I hope I fulfill the expectations rather sooner than later.

Thursday, September 6, 2012

Apple from the Tree

I try to keep this post short. New update is available:
  • MacOSX Darwin PowerPC support is implemented.
  • Fixed address distance calculation for the PowerPC native relative branch instructions.
  • Refactored the boolean values to use TRUE/FALSE preprocessor defines.
Big thanks goes to Tobias Netzel, who implemented the MacOSX support for the JIT compiling and helped me chasing down one more sneaky bug.

Some details on the bug: previously the negative relative branch calculation was completely wrong, which caused jumping to invalid addresses among certain situations and made the application crash.

END-OF-TRANSMISSION

Sunday, September 2, 2012

Bug! *Splat*

Every developer knows the feeling when finally he/she finds a bug and slaps to the forehead while mumbling: "How on earth was this thing ever been working?..."

Well, it just happened to me, I have fixed a bug that stopped the ROM from booting. It was a rather stupid mistake (as usually); for the details check out the update.
In this other minor update I have fixed one more nuance with the wrong addresses in the dumped PowerPC code log.

Right now the emulation advances even further in the booting process than before, when it stops with this cryptic message:

Compiling error: instruction or addressing mode is not implemented, but marked as implemented: 0x323b

Unfortunately, this is true: this is a move instruction with complex addressing mode (68020), which one is not implemented yet.
Since the move instruction itself is marked as supported and all the addressing mode is listed in the descriptor it busts me big time and calls me a liar. Fair enough.

I promise that I implement all the missing addressing modes soon. Honest.

I was so excited that I tried to run the ROM without compiling the instructions but in this case I got back to the previous problem: the reboot loop. :(

At least one bug was squashed again.

In the meanwhile Tobias managed to port the JIT to PowerPC MacOSX. For the speed check out his comment. I hope he sends me the changes soon and I can add it into the main source repository.

Thursday, August 30, 2012

Optimize It

I am back from the holidays for sometime now, but I got swamped instantly by the work in my daytime job. I had very little energy on the project in the nights, which was spent on two things: chasing that #@!% bug which is killing me and implementing the data-flow optimization.

Needless to say that I failed to track down the bug again, just like every time I spent hours on looking at couple megabytes of log dumps. Next time I will try a new approach, suggested by Stephen Fellner: reducing the code complexity while the bug can be reproduced.
We will see how it goes. The processor cache emulation is certainly complicates things, that can be eliminated at least.

While I acknowledged the failure again, I gathered all my previous thoughts and implemented the data-flow based optimization which does a really nice job.

The update is small, but highly important again:
  • Implementation of code optimization.
  • New configuration was introduced: comp_optimize to turn it on/off.
  • Bugs in the register input/output flag specifications are fixed.
  • Fixed too small string buffer in the 68k disassembler, previously the debugger crashed every time when memory was disassembled to the screen.
  • Implementation of MOVEM.x regs,-(Ay)

What, how and why?

As I tried to explain earlier in this post some of the compiled code is completely useless. The emulated instruction consists of the following parts usually:
  • Initialization of certain temporary registers;
  • Executing the actual operation;
  • Alter the arithmetic flags according to the result;
  • Save the result somewhere (into memory)
Some of these operations are not common or can be done for a series of instructions ahead, like loading the previous data for the emulated registers into temporary registers, but there are parts that cannot be avoided if we examine the emulated instruction out of its context.

For example the arithmetic flags are overwritten quite often by the following instructions, what means: if the next instruction(s) are not depending directly on the flag results then we can remove the code that generates these.

Typically, if we (as mighty humans) look at the instructions in a block we can easily identify the context, we can tell mostly which instructions produce usable results and what is not important. Like in most of the cases simple for a human, but it is a really complex job for the code compiler.

As I previously described: the emulated code is broken up into macroblocks by the precompiling. These blocks represent an "atomic" operation, like load data into temporary register, compare two registers or calculate certain arithmetic flags from the previous result.
When I started working on this concept I figured out that if I would be able to identify exactly for each macroblocks the previous result(s) that it is depending on and the result(s) that it produces then I can evaluate the dependencies between the subsequent macroblocks.

Going with the flow

For Petunia I already implemented a similar concept, but that was limited to the flag usage and it is not able to split up the emulation of the arithmetic flags into individual flag registers: it is either emulated completely or removed completely. (Which means that even if we needed only the C register later the rarely used X register will be calculated too, usually after the C register was already done. If you don't understand the reason behind this: don't worry - you need to know more about 68k assembly.)

For E-UAE the implementation is radically different. For any macroblock there is the possibility of specifying each and every emulated register, flag or temporary register as dependecy (input) and/or as result (output).

In the recent changes I implemented the rather simple solution for calculating the data-flow for each registers after all macroblocks are collected for a block of instructions.

It is capable of finding out for each macroblock whether the produced results are relevant for the following instructions in the block or not. If not then that macroblock can be eliminated from the compiled code because no instruction is depending on its results.

 

How to use it

Although, the optimization is completely safe (it won't remove any code that is essential) while the emulation is not stable enough there might be some bugs. So, I introduced a new configuration for turning it on/off, called: comp_optimize.
It replaces the comp_nf configuration from the x86 JIT implementation, because it is not just about the flags (nf = no flags).

If it was set to true then the data-flow calculation is done and some macroblocks will be removed. By setting it to false the emulation compiles all the instructions fully into the buffer.

 

The results

And finally some speed tests... Compared to the previously published Mandelbrot test results the actual numbers are:

Interpretive: 108 seconds;
JIT compiled without optimization: 52 seconds;
JIT compiled with optimization: 32 seconds.

That is roughly 40% speed increase in the case of this (heavily arithmetic) test.

The test system was: Micro AmigaOne (G3/800 MHz) - let's compare it to WinUAE that is running on my laptop (Intel Core i3 M350/2.27 GHz):

JIT compiled with all the possible optimizations turned on: 9 seconds.

It would be a tough job to compare these two computers, but I am pretty sure that my laptop is more than 3.5x faster than that poor old G3 machine.

I am really content with these results for now.

Friday, July 13, 2012

Summertime!

After a long-long time we are going to go on holiday with my beloved wife.

While I am AFK (pick one) please just sit patient and try to find the bug that prevents the OS from booting, will ya?

Thanks. :)

See ya in August.

Sunday, July 8, 2012

JIT Goes Blue

Although, it was just a small update, but highly important:
  • Thanks to Anonymous #1 Thore and Anonymous #2 itix (from the comments section for the previous post) MorphOS support for the JIT compiling is now implemented. (I had no possibility to test it, but fingers crossed...)
  • A bug is fixed in the memory read/write handling. It caused illegal memory access when the 3.x Kickstart was running, the stackframe was trashed due to a wrong offset calculation for the register saving.
    Unfortunately, this is not the fix what is needed for let the AmigaOS boot yet, but at least one more baby step toward that direction.
Enjoy!

P.S.: Anonymous MorphOS devs, don't you want to reveal yourselves? :)

Sunday, July 1, 2012

Lies, Damned Lies, Statistics and Benchmarks

Another nice batch of instructions implemented in the recent update. This is an important milestone: most of the instructions are JIT compiled now, that are used in the Mandelbrot test. Why is it important?

Because finally we can do some benchmarking!

Source: http://wumocomicstrip.com/2012/04/16

 

The update


So, the changes in this update are:
  • Implementation of
    ADD.x #imm,Dy,
    ADD.x regy,Dz,
    ADDA.x regy,Az,
    ADDI.x #immq,Dy,
    AND.x #imm,Dy,
    ANDI.x #immq,Dy,
    ASL.x #imm,Dy,
    ASR.x #imm,Dy,
    BTST.L #imm,Dx,
    CMP.x #imm,Dy/Ay,
    CMPA.W #imm,Ay,
    CMPI.x #immq,Dy,
    EOR.x #imm,Dy,
    EORI.x #immq,Dy,
    LSR.x #imm,Dy,
    MULS.W Dx,Dy,
    MULU.W Dx,Dy,
    ORI.x #immq,Dy,
    ROL.x #imm,Dy,
    SUB.x regy,Dz,
    TST.x reg instructions.
  • Fixed OR.x reg,reg flag dependency.
  • Recycling of mapped temporary registers is implemented.
  • Added the possibility of inverting C flag after extracting.
  • CMP.L #imm,Ax and CMP.W #imm,Ax were removed from the table, there are no such instructions.

Phew, that was a nice long list.

Recycling of the temporary registers

While I was running amok on implementing the selected instructions for the Mandelbrot test, I had to face the inevitable: the shortage of the temporary registers. When I implemented the register allocation subsystem I tried to postpone the fix for this problem, but since a lot more instructions are implemented the consistent translated chunks are getting longer. Suddenly, I got the error message that I wrote a couple months ago:

“Error: JIT compiler ran out of free temporary registers”

Probably this error needs some explanation. Let’s start with how the registers are handled in the JIT engine:

In Petunia the emulated 68k registers are statically mapped to PowerPC registers. This approach has its benefits, like constructing the code that handles the emulated registers is much more easy. But also has some major issues, especially because all registers must be up-to-date at anytime. Whenever the execution is leaving the compiled code all registers must be saved and restored to the previous state. Needless to say how long that takes: more than 20 registers must be saved.

In E-UAE this would be a no-go. The compiled code consists of smaller contiguous chunks and the support functions from the environment must be called quite often. The rest of the emulator was written in C and it expects a certain state of the registers at a time (which is described in details in the PowerPC SysV ABI document).

How to overcome of these issues? First, I realized that there is no point in keeping all the contents of the emulated registers in real registers all the time. The usage of certain emulated registers are localized in a certain range of the code that consists of closely tied instructions mostly.

Then I figured out from the SysV ABI that I can safely use a range of registers that needs no saving/restoring and also I can make use of the non-volatile registers that must be restored, because other called functions must restore that too.

I ended up implementing a dynamically allocated register serving system that can handle the needs of the compiling.

There are two types of registers: the temporary registers that are used in one instruction only for a specific role and the mapped 68k registers that are loaded once and kept until the registers must be released because the execution leaves the compiled code.

Back to the error message: eventually the free registers are running out because the compiler keeps the mapped 68k registers between instructions. I had to figure out some solution for enforcing releasing a previously mapped register when there are no more free registers left. Also, I had to make sure that I wouldn’t release the register that was allocated just recently and needed for the actual instruction compiling (or for the following instructions that might depend on the same 68k register).

This issue was resolved by some simple locking mechanism for the mapped registers: when a register is mapped then in the same time it is locked for the actual instruction. At the end of an instruction all mapped 68k registers are unlocked but not released.
On enforced releasing of the registers I look for an unlocked mapped register and release it.
But I start searching for the next unlocked register in a round-robin manner, to make sure I won’t release-allocate-release the same register all the time.

This system looks robust enough to handle the needs of the compiling.

Back to the lies… errr… benchmarks

As I already mentioned: in the implementation I reached the point when almost all instructions are implemented that are needed by the Mandelbrot test. So let’s see how the compiled code performs.

The test system consists of a Micro AmigaOne (G3/800 MHz), a slightly outdated AmigaOS4 version and my beloved Galaxy S2 who acts as a stopwatch. (Yes, I measure the time by using stopwatch. I didn’t want to spend too much effort on figuring out how to measure the time more precisely.)

Running the Mandelbrot test, I got the following results:
 
Interpretive: ~6 seconds,
JIT compiled: ~3 seconds.


Not too precise, but as it seems the JIT compiler is doing nice, the compiled code finished twice as fast as the interpretive.

To confirm the results I created a new version of the Mandelbrot test that needs lot more processor power to complete. You can find it among the test kickstarts, it is called: mandel_though_hw.kick. The code is essentially the same as the previous Mandelbrot test, but I tweaked the parameters a bit: zoomed in and adjusted the drop out threshold, so the picture is more detailed.

Honey! The Mandelbrot is almost done in the oven!
The results are:

Interpretive: 108 seconds,
JIT compiled: 52 seconds.


Yay! It is indeed twice as fast! Rejoice!

Disclaimer

Why we must not jump to any conclusions just yet regarding the speed...

First of all, not all instructions are implemented yet. This most likely means more speed increase, because the JIT instructions are always faster or at least as fast as the interpretive implementation. When the emulation hits an not implemented instruction then it has to call the interpretive implementation, which means essentially storing all the emulated registers in the memory and restart the register mapping.

Also, the register flow optimization is not implemented yet, that will be certainly a big boost on the speed, it will eliminate lots of instructions from the compiled code.

On the other hand, this test relies more on the rather simple, mostly processing-demanding instructions. When the test code is doing much more memory accessing then the difference between the interpretive and the JIT compiled will be almost certainly much smaller.

Anyway, I hope I assured all of you that we are heading to the right direction.

Full speed ahead! (Don’t mind the mines.)

Monday, June 4, 2012

God Save the Queen

Thank you dear Queen Elizabeth II that Your Majesty has birthday, just like us, puny humans do. Even if it is not on this day yet we got a day off to celebrate in New Zealand too. (Well, logic is not a strong point of human society.)

So, I decided to celebrate Her Majesty's birthday by doing some coding.

The update is a little bit short and not too interesting, just a couple more instructions implemented:
  • Implementation of
    ADDQ.x #imm,Ay,
    ADDA.x #immq,Ay,
    ADDQ.W, ADDQ.B,
    OR.x reg,reg,
    MOVEA.x mem,mem,
    MOVEA.x mem,Ay,
    MOVE.x mem,Dy,
    SWAP instructions.
  • Added support for saving registers to the stack frame temporarily.
  • Fix for SWAP (and other not yet implemented instructions) with one register parameter only: use destination instead of source addressing mode.
  • Added missing object to the configure script for the PowerPC disassembler and moved the header file for it to the includes folder where it belongs.
Funny story, though... I thought I had found a bug in Petunia while I implemented the moving and adding instructions where the target is an address register. Then I realized that address registers are always treated as longword sized, no matter what the size of the operation is.
But I already knew that and implemented it properly in Petunia. I just forgot about it, because it was 11 years ago! :)

Tuesday, May 15, 2012

Peeking behind the curtain


What's new

I have just uploaded a small update again to the Subversion repository. In this update you can find the following changes:
  • Implementation of all form of the following addressing modes for both as source and as destination addressing:
    • immediate long, word, byte;
    • indirect addressing: (Ax);
    • indirect addressing with pre-decrement: -(Ax);
    • indirect addressing with post-increment: (Ax)+;
  • Implementation of instructions:
    • MOVE.x #imm,Dy;
    • MOVEA.x #imm,Ay;
    • MOVEQ #imm,Rx;
    • MOVE.x Rx,Ry
  • Code was refactored to avoid repetition of the same code chunks in the addressing modes and instructions.
  • Implementation dumping the compiled code to the console.
  • New configuration (comp_log_compiled) is also added to turn it on/off.

Featuring...

It is always exciting to find out some magic details about the internal behavior of a complex system. I remember when I realized for the first time how the texts are stored in the Commodore Plus/4 games, back in the days. I was amazed how I can change the text that showed up on the bottom scroll of Tycoon Tex.

Let me offer some small excitement to you all: I just implemented a funny little feature in the E-UAE JIT engine. Now we can turn on dumping the compiled code to the console, together with the original Motorola 68k instruction that was compiled and the macroblocks that describe the intermediate translation form.

The purpose of this feature wasn’t (purely) the entertainment, but I was really fed up with the situation that the generated code cannot be debugged properly. Previously, I added a trap instruction (tw) into the translated code at some point, so I was able to have a look on the output from the Grim Reaper window (which was awesome, but let’s not mix it up with actual debugging).
Too bad that GDB is so limited: it cannot debug into any code segment that wasn’t loaded by DOS (like generated code). Not to mention how cumbersome the console interface is... (Or am I missing something? Enlight me please.)

I would like to thank Frank Wille the sources for the PowerPC disassembler that makes it possible to list the translated code.

How to turn on this feature: there are two settings that control the logging. These are:
  • comp_log – if it was set to “true” or “yes” then the JIT logging is turned on and dumped to the standard output.
  • comp_log_compiled – if it was set to “true” or “yes” then the compiled code is listed through the JIT logs.

Let’s see a small demonstration of this feature, shall we? (Not for the faint-hearted!)

The following list is the output from the very simple test code: iamalive.asm, slightly edited and formatted for educational purposes...
  1. M68k: ADD.L #$00000001,D1
    1. Mblk: load_memory_long
      Dism: lwz r15,64(r14)
      Mblk: load_memory_long
      Dism: lwz r3,68(r14)
      Mblk: rotate_and_copy_bits
      Dism: rlwimi r15,r3,16,26,26
    2. Mblk: load_memory_long
      Dism: lwz r3,4(r14)
    3. Mblk: load_register_long
      Dism: li r4,1
    4. Mblk: add_with_flags
      Dism: addco. r3,r3,r4
    5. Mblk: copy_nzcv_flags_to_register
      Dism: mcrxr cr2
      Dism: mfcr r15
      Mblk: rotate_and_copy_bits
      Dism: rlwimi r15,r15,16,26,26
  2. M68k: MOVE.W D1,(A0,$0180) == $00dff180
    1. Mblk: load_memory_long
      Dism: lwz r4,32(r14)
    2. Mblk: add_register_imm
      Dism: addi r5,r4,384
    3. Mblk: check_word_register
      Dism: extsh. r0,r3
      Mblk: copy_nz_flags_to_register
      Dism: mfcr r6
      Mblk: rotate_and_copy_bits
      Dism: rlwimi r15,r6,0,0,2
      Mblk: rotate_and_mask_bits
      Dism: rlwinm r15,r15,0,11,8
    4. Mblk: save_memory_long
      Dism: stw r3,4(r14)
    5. Mblk: save_memory_spec
      Dism: mr r4,r3
      Dism: mr r3,r5
      Dism: rlwinm r0,r3,18,14,29
      Dism: lis r5,27315
      Dism: ori r5,r5,23016
      Dism: lwzx r5,r5,r0
      Dism: lwz r5,16(r5)
      Dism: mtlr r5
      Dism: blrl
  3. M68k: BT.B #$fffffff8 == 0000001a (TRUE)
    1. Mblk: save_memory_long
      Dism: stw r15,64(r14)
      Mblk: save_memory_word
      Dism: sth r15,68(r14)
    2. Mblk: load_register_long
      Dism: lis r3,27606
      Dism: ori r3,r3,45096
      Mblk: save_memory_long
      Dism: stw r3,76(r14)
    3. Mblk: opcode_unsupported
      Dism: li r3,24824
      Dism: lis r4,27315
      Dism: ori r4,r4,21752
      Dism: bl 0x7f91acc0

  4. Done compiling
Colorful, isn't it? :)

Okay, let's try to understand what is going on.

I marked the three Motorola 68k instruction that was compiled here with orange color, the code roughly looks like this:

1. Increase register D0 by one;
2. Put the content of register D0 to the address that is calculated by using register A0 plus offset of 0x180 (A0 was initialized previously with the value: 0xDFF000, which is the base of the custom chipset memory area) - in layman terms: load it to the background color.
3. Go back to step 1.

Now, let's see the second level of the list:

First of all the prefix "Mblk:" marks the macroblocks (white), "Dism:" is the actual PowerPC code (yellow).
As I already mentioned earlier: some macroblocks can be optimized away (although it is not implemented yet), and a macroblock means at least one PowerPC instruction, but it can be a series of instructions also.

The steps can be interpreted roughly as:

1.1. Load the arithmetic flags from the memory where the interpretive emulator stores them.
1.2. Load the previous content of the emulated D0 register into a PPC register.
1.3. Load the constant for the add instruction (one) into a PPC register.
1.4. Add the second register to the first one (increase D0 by one).
1.5. Save the arithmetic flags after the operation.

2.1. Load the previous content of the emulated A0 register into a PPC register.
2.2. Add the offset (0x180) to the content of A0 and load it into a new PPC register.
2.3. Check the content of the emulated D0 register to set up the arithmetic flags according to it.
2.4. Save back the modified D0 register to the memory for the interpretive emulator.
2.5. Calculate the offset and load the function address for the memory write operation handler and call it (namely the custom chipset write handler). This is a function from the interpretive emulation and it was written in C, therefore we must store all volatile registers back to the memory, the C code won't preserve these. (This is why we stored the D0 register in step 2.4.)

3.1. Save the arithmetic flags back to memory where the interpretive emulator stores them. (These were kept in a non-volatile register, so these were preserved while we called the helper function in step 2.5.)
3.2. Update the emulated PC register to the current state for the following instructions.
3.3. Call the interpretive emulation for the branch instruction (because it is not implemented yet, so we reuse the interpretive implementation).

4. Done. Phew.

Funny, eh? :)
If you are not familiar with assmebly then don't stretch yourself too much by trying to understand this techno-blahblah.

For the rest: who can spot what can be optimized on the compiled code?

Tuesday, May 1, 2012

Deep-diving into memory

Since I just bought a new SSD and I am about to reinstall my laptop, it is time to do another update on the project: another batch is uploaded to the SourceForge repository. (I hope I won’t lose the sources… ;)

So, what was made into this update? A quick list about what happened in the last two weeks:

  • Direct memory writing support
  • Detecting of special memory accesses per instruction and calling the chipset emulation on memory writes from the compiled code
  • Implementation of some instructions:
    • ADDQ.L;
    • MOVE.x register to memory;
    • MOVE.x register to register;
  • Implementation of some addressing modes:
    • immediate quick, like: ADDQ.L #x,reg;
    • indirect addressing with post increment, like: MOVE.x reg,(Ay)+;
  • Misc fixed stuff:
    • proper logging for JIT (with a new configuration item: comp_log);
    • slightly reworked temporary register handling;
    • better handling of reloading/saving of flags and the base register;
    • reloading of the emulated PC before leaving the compiled code is added (probably this was responsible for the reboot-loop at booting the OS, see below).

The most important change was the handling of the memory writes. When I started to plan the project this was one of the concerns: how to figure out which memory access requires special handling and which one can hit directly the emulated memory. By overcome of this issue all three problems are resolved:

  • memory accesses are emulated as it is needed;
  • self-modifying code is handled by using the processor cache;
  • translated code lookup is managed by simulating the cache lines.
Luckily, the x86 JIT implementation already solved this issue (too): every instruction is executed by the interpretive emulator first for a couple times to prevent unnecessary translation of the code pieces that won’t be executed often (like initialization/cleanup code). While it is doing that it collects some information about the executed code that can be reused for the translation. One of these is the memory access type. Very clever solution, I must admit.

I am still chasing the ghost-bug that prevents the OS from booting. The recent fixes for the PC register (program counter) reloading changed the behaviour. Instead of running around in a reboot-loop, the OS is waiting for something to happen, most likely some hardware to trigger something. So, it is still no good, but looks a bit better.

Where to go from here: more addressing modes and instructions will be implemented before I start working on the implementation of the optimizing routines. First, I want to make sure that most of the Mandelbrot test is executed using compiled code, thus it will be easier to test the optimization and finally we can see how better the JIT performs than the interpretive. Exciting!

You have probably noticed that the state of the project on SourceForge is still pre-alpha. I don’t want to change it while the OS cannot be run on the emulation, so this is the other important goal.

So much for today. While you are enjoying the sunny, warm summer, don’t forget about us who are suffering from the cold, rainy days: Winter is coming… :/

Monday, April 16, 2012

Foundations of the House

To show some progress I checked in a massive changeset into the SourceForge SVN that represents the implementation of the following features:
  • Macroblock collection and handling
  • Code generation from the macroblock buffer
  • Macroblocks for many basic high level instructions that are used for the code translation intermediate representation
  • Temporary register allocation/freeing, flushing
  • Emulated 68k register mapping to temporary registers, automatic loading/saving from/to the interpretive emulator Regs structure
  • Compiling of pre- and post-code for the translated block
  • Temporary PPC register, emulated M68k register and flag flow-tracking for macroblocks
  • Basic functions for flag emulation implementation
Sounds quite a big chunk of work, and to tell you the truth it really was. As it seems so far my initial plans are working, the macroblocks are the implementation of the Microcode-VLIW-like approach I explained previously.

What is still missing for your happiness: there is no addressing mode and instruction implementation yet. So, basically this is still not too useful for the average users, thus there is no binary release yet.

But all of these are the foundations of the house and now I can go on with building the walls, which will be much more perceptive: addressing modes and instructions.

Unfortunately, I wasn't able to find the bug that blocks the Kickstart from running, maybe an other Kickstart file might help, we will see.

In the meanwhile enjoy this beautiful diagram of the (oversimplified) method of compiling in this implementation:

Code translation flow diagram
Code translation flow diagram

Sunday, February 26, 2012

Test Kickstarts are kickin'

Finally, I managed to publish the fully working build environment for creating Kickstart files, you can find it in the project's SourceForge Subversion repository.

If you are looking for the binaries for these tests then check out the downloads.

Please make sure that if you would like to fool around with these files then you read the post about them here and the README files in the folders.

Some more tests are coming as soon as I have time to clean up the sources.

Tuesday, February 21, 2012

A year later...

Sooo... Here we are. *nervous giggle* The year just passed and probably you have guessed already that the project is not finished yet.

But worry not: things are moving forward (or rather crawling slowly). Well, this is the fate of the hobby projects, I guess.

About the recent state:
The environment for the compiling and the callback mechanism for the interpretive instruction emulation is close to be ready. Unfortunately, there is some glitch in the cache handling mechanism that prevents Kickstart from running under JIT compiled emulation. I have spent lot of time to find out why it is not working, but I was not able to find the issue yet.
On the other side: my Kickstart replacement test codes are working properly, so it is most likely just matter of time.
The addressing modes and the instructions are not finished yet, that won't take a very long time hopefully.

But let's not show up on the birthday party empty-handed: I have created a public SourceForge project, so now you can get the source code from there and try compiling it, if you didn't have anything better to do. (I mean: really nothing better at all. Like combing the cat, folding paper unicorns or develop an iPhone application.)
I left out the half-baked instruction emulation code for now, so this published version is for compiling callback code to the interpretive emulation functions only. Let me know if you found why the cache handling is not working... ;)

You can find the SourceForge project here:

From now on, as soon as something is mostly ready then I am going to commit it into the Subversion repository. Later on, the AmigaOS4 binaries will be uploaded to the project site (or binaries for other OSes if volunteers do the compiling).

Thank you for your patience...

Before I forgot, you need to change three things in your UAE config:
  • change the processor type to 68020: cpu_type=68020
  • specify some code cache size for the compiling (in MB): cachesize=8192
  • turn off constant jump direct compiling: comp_constjump=no