The Big E-UAE JIT blog: macroblock

Showing posts with label macroblock. Show all posts

Wednesday, May 21, 2014

PPCJITBETA03 (Switch to Ludicrous Speed)

Welcome back, long time no see, my friend. Please be seated.

I can tell you I have some good news to you again: here comes Beta #3!

Get it here if you must:

https://sourceforge.net/projects/euaeppcjit/files/0.8.29-PPCJITBETA03

I held this release back for a while just to fix the emulated cache checksum feature, but I have been chasing a bug for two weeks without any success. So, that fix is postponed to the next beta, in the meanwhile you can enjoy some significant speed enhancement and increased stability.

Without going into the details regarding the changes (see the included README for all changes) I would like to mention the most important change:

Vroom-vroom

The major feature of this beta release is the register and flag optimization fix. You can turn it on in the configuration, just set comp_optimize to true.

If you are interested in the details I explained it already how the optimization works in an earlier post, but in case you are too lazy to read through that post: here is my old diagram (just because it is beautiful, you know):

Code translation flow diagram

Let me summarize it for you: the JIT compiler is collecting information about data-flow dependencies between the various macroblocks and tries to remove the ones which won't have any effect on the outcome of a certain block of macroblocks.
This is not a new feature in the JIT implementation, but previously a few (tons of) bugs prevented it from working on more complex codes than my Mandelbrot test.

In this release I have fixed every issue I have found so far with the optimization and it seems working quite nicely. You can boot the AmigaOS and it runs just fine, also games and demos will benefit from this feature too.
I was planing to do a comparison video where the speedup is clearly shown, but I haven't had too much time yet, so this is your job now, dear EUAEPPCJIT fans! Just post the links to the videos into the comments here. :)

PPC970 aka G5

Not everything is sunshine and happiness, though. Supporting G5 processor architecture target turned to be much more complicated than I thought, especially because I don't have any hardware to test on.

In the previous release the MacOSX G5 binary was not working properly on G5 (neither on any other PowerPC as matter of fact). Thanks to Luigi Burdo for the report and Tobias Netzel (again) for the help with the compiler. This is fixed in this release, hopefully. (Fingers crossed, I still don't have hardware to test on.)

While the situation with the MorphOS G5 version is not that hunky-dory: as it seems there is no official compiler with G5 support yet in the MorphOS SDK and it is rather complicated and unreliable to compile any source for that processor. Until this situation is improved the G5 version for MorphOS won't be available from the beta binaries.

However, nothing stop you from compiling your own version from the sources, as these are always available at SourceForge.

Upcoming

As I mentioned: I postponed the fix for the block checksum to the next release and also picked up some things to do. You can find the planned list here:

https://sourceforge.net/p/euaeppcjit/tickets/milestone/PPCJITBETA04

I also had a look on what is planned for the first stable release and moved some items around the various milestones. If you are curious just click at the milestones on the Sourceforge page.

Monday, November 4, 2013

On a collision course

So, where are we at with the recent update? 349 out of 387: no, it is not 100% yet, but a nice, fat 90.1%!

Quickly, what was done this time and let's jump to conclusions right after that:

Implementation of BFINS Dx,Da{y,z}, BFFFO Dx{y,z},Da, BFEXTS Dx{y,z},Da, BFEXTU Dx{y,z},Da, BFCLR Dx{y,z}, BFTST Dx{y,z}, BFSET Dx{y,z}, BFCHG Dx{y,z}, ROXR.x Dy,Dz, ROXR.W mem, ROXR.x #imm,Dy, ROXL.x Dy,Dz, ROXL.W mem, ROXL.x #imm,Dy, ASR.W mem, ASL.W mem, LSR.W mem, LSL.W mem, ROR.W mem, ROL.W mem instructions.
Opcode compiler fucntions for different operation sizes are unified, unnecessary helper functions were removed.
Macroblock protos are generated from the opcode table file rather than used the manually prepared file.
Fixed wrong function name for the CNTLZW PowerPC instruction emitter.
Fixed missing input register in C and X flag extraction macroblock which is used in some shift instructions.
Fixed register dependency in ASR.x Dy,Dz and LSR.x Dy,Dz instructions.

(Slightly unrelated: I have found a bug in the original E-UAE code for BFINS and flags, if I will have enough patience I will fix that. And another one in Petunia, that will be fixed too...)

Now, we are getting dangerously close to the beta stage. It is really hard not to give you guys any promises what I cannot keep later on... (Summer is coming, you know... ;)

Anyway, my plans for the near future (read: when it is done) are:

After I have finished with the implementation of all instructions which were anticipated earlier I am going to stabilize the emulator a bit and clean up the code if needed.
I will try to release a compiled beta version and put together some documentation for using/testing it.
There are some outstanding issues, one of the most critical one is fixing the problems with the macroblock optimizer.

So much for the crystal ball and now...

Something completely different

I have to admit that my posts were not that interesting to read recently. Earlier I invested more effort into the posts and it was probably more fun to read, especially to the developers.
I would like to bring back that tradition, so for this update I came up with a few interesting thoughts on:

How to avoid the decisions

I know that many developers love to procrastinate anything, including (but not limited to) decisions. But what I am about to write is not how lazy my fellow developers are, but rather how to get around a situation when it is not ideal to do a comparison and branch according to the result.

Why would that be important at all?
There are a number of reasons why it is better to avoid branching, this article lists many examples for that and also explains the reasons. It basically boils down to the following reasons:

branching can cause cache misses;
the branching instruction is a useless overhead and it can be even slow on certain architectures;
conditional branching disrupts the pipelining of the execution (although there are pretty sophisticated techniques available to deal with that).

In our case none of these reasons apply, but unfortunately branching is not really possible due to the static flow analysis on the macroblock list.

To put it simply: each macroblock is depending on the results of the previously executed macroblocks, if we skip ahead then it is impossible to tell whether those required macroblocks were executed or not before the dependent macroblocks.

Why does this cause any trouble? Because sometimes it is pretty hard to avoid conditions inside the instruction implementations.

I already faced this issue earlier, when I started to work on the conditional branching and setting instructions (Bcc and Scc). There was no possible way to avoid the conditional branching for these instructions due to its very nature of the instructions.
I ended up creating a very specific macroblock which embeds the condition checking and branching, so the "inter-macroblock" flow analysis remains intact.

In the recent update you can find many bit field instructions which were complicated enough already and then the very same issue came up again:

For bit field instructions 0 (zero) bit field width means 32 bit actually. This doesn't sound that bad, however sometimes the bit field width is coming from an emulated data register instead of a statically encoded constant. In this case the decision must be done in the emulated code instead of in compiling time.

Not in every case, but quite often it is possible to calculate the result instead of making a comparison and branching.

For the bit field instructions the solution was (written in C code, just because it is easier to understand):

temp = width - 1;
width = width | (temp & 32);

(There is one condition, though: the width must be between 0 (zero) and 31 before this operation starts.)

Now, think about it a little bit and figure out on your own why does it work. I am not going to explain it. :)

See you soon.

Wednesday, October 2, 2013

Back on track

Finally, I am back on track with the development after the house move. There were two updates since the last post; one was a follow-up to the previous code refactoring; the other was a massive list of improvements:

Implementation of NEGX.x mem, NEGX.x Dy, SUBX.x -(Ay),-(Az), ADDX.x -(Ay),-(Az), MOVE CCR,mem, MOVE CCR,Dx, EOR #imm,CCR, OR #imm,CCR, AND #imm,CCR, RTR, MOVE mem,CCR, MOVE Dx,CCR, MOVE #imm,CCR, ASR.x Dy,Dz, RTD, MULU.W mem,Dx, MULU.W #imm,Dx, MULS.W mem,Dx, MULS.W #imm,Dx, SUBX.x Dy,Dz, ADDX.x Dy,Dz, MOVEM.x regs,mem, MOVEM.x mem,regs, MOVEM.x (Ay)+,regs, CMPM.x (Ax)+,(Ay)+ instructions.
Added dependency tracking for non-volatile PowerPC registers.
Fixed X flag handling in register-based shifting instructions, previously the X flag was cleared together with the C flag if the shift steps were zero.
Removed RTM from the list of the potentially supported opcodes.
Added RTR back to the list of the potentially supported opcodes.
Optimized temporary register usage in MULU.W Dx,Dy and MULS.W Dx,Dy instructions.
Introduced tracking of the extension words after the instructions, it is needed for adjusting the PC before certain addressing modes are processed.
Fixed register dependency and order of register storage for MOVEM.x regs,-(Ay) when direct memory access is enabled.
Implemented stack-like concept for register saving into the context.
Code cleanup: removed unused reference, fixed some warnings regarding misformatted and unused code lines.

So, things are getting together slowly. The number of the implemented instructions went up to 321 out of the planned 387! That is roughly 82%... Getting closer and closer... :)

Recently I faced an interesting problem, I am not quite sure how can I solve: division by zero. My beloved math teacher already told me: who is trying to divide by zero is an idiot. (Well, that is not quite right, as we know it.) Yet, some programs might try it.
Why is that a problem? Because it triggers an exception inside a compiled block. It also needs branching (skip the exception triggering if the divisor is not null, for a change) which contradicts the macroblock register flow tracking. Well, here is the challenge, but I am pretty sure I will solve it somehow.

'Til then the usual: watch this space.

Sunday, August 18, 2013

Moving stuff around

Long time no see! No wonder: we just moved to a new house, which usually means lots of things to deal with and boxes wherever I look. (I hate this soo much! I badly need my daily routine, but how can I find my stuff?!) Finally, we are getting settled again and I was able to dig up the good ol' Amiga again under the pile of boxes. And this new house is just awesome, so it worth all the pain we went through.

Interesting coincidence, but right before we started the whole move house craziness (seems like ages ago, BTW) I decided to refactor some code in E-UAE JIT. Namely the handling of the temporary registers.
It was such a bad design, or rather no design at all: I used the number of the allocated temporary registers to index a couple arrays with the relevant data in it. Can you believe this? In 2013? This was unacceptable even in the '70s. There are so many reasons why you shouldn't do that.
Also what is the point of using a strongly typed language, if we don't use distinct types? In this case all the temporary registers along with the actually mapped PowerPC processor registers are passed to the functions as integers. Very easy to misuse. Bad, bad, bad.

In the recent changes I have managed to introduce the concept of a temporary register-specific type structure, which is also able to carry around the mapped PowerPC processor register number and the register dependency map for the macroblocks.

This new structure can be used when the macroblocks are collected (so mostly in the helper functions), but the code emitters for the macroblocks are dealing with the mapped PowerPC registers only. Now, I still need to solve that one: at the moment these are still passed to the functions as integers.

And what is the visible output for you guys? Nothing, if we are lucky. ;)
Yea, I know. It is always hard to justify the time to the business owners what was spent on solving technical debts. But this was a long outstanding one and I always had the itch of fixing it.

So, enjoy your Summer while it lasts and bear with me.

Thursday, August 30, 2012

Optimize It

I am back from the holidays for sometime now, but I got swamped instantly by the work in my daytime job. I had very little energy on the project in the nights, which was spent on two things: chasing that #@!% bug which is killing me and implementing the data-flow optimization.

Needless to say that I failed to track down the bug again, just like every time I spent hours on looking at couple megabytes of log dumps. Next time I will try a new approach, suggested by Stephen Fellner: reducing the code complexity while the bug can be reproduced.
We will see how it goes. The processor cache emulation is certainly complicates things, that can be eliminated at least.

While I acknowledged the failure again, I gathered all my previous thoughts and implemented the data-flow based optimization which does a really nice job.

The update is small, but highly important again:

Implementation of code optimization.
New configuration was introduced: comp_optimize to turn it on/off.
Bugs in the register input/output flag specifications are fixed.
Fixed too small string buffer in the 68k disassembler, previously the debugger crashed every time when memory was disassembled to the screen.
Implementation of MOVEM.x regs,-(Ay)

What, how and why?

As I tried to explain earlier in this post some of the compiled code is completely useless. The emulated instruction consists of the following parts usually:

Initialization of certain temporary registers;
Executing the actual operation;
Alter the arithmetic flags according to the result;
Save the result somewhere (into memory)

Some of these operations are not common or can be done for a series of instructions ahead, like loading the previous data for the emulated registers into temporary registers, but there are parts that cannot be avoided if we examine the emulated instruction out of its context.

For example the arithmetic flags are overwritten quite often by the following instructions, what means: if the next instruction(s) are not depending directly on the flag results then we can remove the code that generates these.

Typically, if we (as mighty humans) look at the instructions in a block we can easily identify the context, we can tell mostly which instructions produce usable results and what is not important. Like in most of the cases simple for a human, but it is a really complex job for the code compiler.

As I previously described: the emulated code is broken up into macroblocks by the precompiling. These blocks represent an "atomic" operation, like load data into temporary register, compare two registers or calculate certain arithmetic flags from the previous result.
When I started working on this concept I figured out that if I would be able to identify exactly for each macroblocks the previous result(s) that it is depending on and the result(s) that it produces then I can evaluate the dependencies between the subsequent macroblocks.

Going with the flow

For Petunia I already implemented a similar concept, but that was limited to the flag usage and it is not able to split up the emulation of the arithmetic flags into individual flag registers: it is either emulated completely or removed completely. (Which means that even if we needed only the C register later the rarely used X register will be calculated too, usually after the C register was already done. If you don't understand the reason behind this: don't worry - you need to know more about 68k assembly.)

For E-UAE the implementation is radically different. For any macroblock there is the possibility of specifying each and every emulated register, flag or temporary register as dependecy (input) and/or as result (output).

In the recent changes I implemented the rather simple solution for calculating the data-flow for each registers after all macroblocks are collected for a block of instructions.

It is capable of finding out for each macroblock whether the produced results are relevant for the following instructions in the block or not. If not then that macroblock can be eliminated from the compiled code because no instruction is depending on its results.

How to use it

Although, the optimization is completely safe (it won't remove any code that is essential) while the emulation is not stable enough there might be some bugs. So, I introduced a new configuration for turning it on/off, called: comp_optimize.
It replaces the comp_nf configuration from the x86 JIT implementation, because it is not just about the flags (nf = no flags).

If it was set to true then the data-flow calculation is done and some macroblocks will be removed. By setting it to false the emulation compiles all the instructions fully into the buffer.

The results

And finally some speed tests... Compared to the previously published Mandelbrot test results the actual numbers are:

Interpretive: 108 seconds;
JIT compiled without optimization: 52 seconds;
JIT compiled with optimization: 32 seconds.

That is roughly 40% speed increase in the case of this (heavily arithmetic) test.

The test system was: Micro AmigaOne (G3/800 MHz) - let's compare it to WinUAE that is running on my laptop (Intel Core i3 M350/2.27 GHz):

JIT compiled with all the possible optimizations turned on: 9 seconds.

It would be a tough job to compare these two computers, but I am pretty sure that my laptop is more than 3.5x faster than that poor old G3 machine.

I am really content with these results for now.

Tuesday, May 15, 2012

Peeking behind the curtain

What's new

I have just uploaded a small update again to the Subversion repository. In this update you can find the following changes:

Implementation of all form of the following addressing modes for both as source and as destination addressing:

immediate long, word, byte;
indirect addressing: (Ax);
indirect addressing with pre-decrement: -(Ax);
indirect addressing with post-increment: (Ax)+;

Implementation of instructions:

MOVE.x #imm,Dy;
MOVEA.x #imm,Ay;
MOVEQ #imm,Rx;
MOVE.x Rx,Ry

Code was refactored to avoid repetition of the same code chunks in the addressing modes and instructions.
Implementation dumping the compiled code to the console.
New configuration (comp_log_compiled) is also added to turn it on/off.

Featuring...

It is always exciting to find out some magic details about the internal behavior of a complex system. I remember when I realized for the first time how the texts are stored in the Commodore Plus/4 games, back in the days. I was amazed how I can change the text that showed up on the bottom scroll of Tycoon Tex.

Let me offer some small excitement to you all: I just implemented a funny little feature in the E-UAE JIT engine. Now we can turn on dumping the compiled code to the console, together with the original Motorola 68k instruction that was compiled and the macroblocks that describe the intermediate translation form.

The purpose of this feature wasn’t (purely) the entertainment, but I was really fed up with the situation that the generated code cannot be debugged properly. Previously, I added a trap instruction (tw) into the translated code at some point, so I was able to have a look on the output from the Grim Reaper window (which was awesome, but let’s not mix it up with actual debugging).

Too bad that GDB is so limited: it cannot debug into any code segment that wasn’t loaded by DOS (like generated code). Not to mention how cumbersome the console interface is... (Or am I missing something? Enlight me please.)

I would like to thank Frank Wille the sources for the PowerPC disassembler that makes it possible to list the translated code.

How to turn on this feature: there are two settings that control the logging. These are:

comp_log – if it was set to “true” or “yes” then the JIT logging is turned on and dumped to the standard output.
comp_log_compiled – if it was set to “true” or “yes” then the compiled code is listed through the JIT logs.

Let’s see a small demonstration of this feature, shall we? (Not for the faint-hearted!)

The following list is the output from the very simple test code: iamalive.asm, slightly edited and formatted for educational purposes...

M68k: ADD.L #$00000001,D1

Mblk: load_memory_long
Dism: lwz r15,64(r14)
Mblk: load_memory_long
Dism: lwz r3,68(r14)
Mblk: rotate_and_copy_bits
Dism: rlwimi r15,r3,16,26,26
Mblk: load_memory_long
Dism: lwz r3,4(r14)
Mblk: load_register_long
Dism: li r4,1
Mblk: add_with_flags
Dism: addco. r3,r3,r4
Mblk: copy_nzcv_flags_to_register
Dism: mcrxr cr2
Dism: mfcr r15
Mblk: rotate_and_copy_bits
Dism: rlwimi r15,r15,16,26,26

M68k: MOVE.W D1,(A0,$0180) == $00dff180

Mblk: load_memory_long
Dism: lwz r4,32(r14)
Mblk: add_register_imm
Dism: addi r5,r4,384
Mblk: check_word_register
Dism: extsh. r0,r3
Mblk: copy_nz_flags_to_register
Dism: mfcr r6
Mblk: rotate_and_copy_bits
Dism: rlwimi r15,r6,0,0,2
Mblk: rotate_and_mask_bits
Dism: rlwinm r15,r15,0,11,8
Mblk: save_memory_long
Dism: stw r3,4(r14)
Mblk: save_memory_spec
Dism: mr r4,r3
Dism: mr r3,r5
Dism: rlwinm r0,r3,18,14,29
Dism: lis r5,27315
Dism: ori r5,r5,23016
Dism: lwzx r5,r5,r0
Dism: lwz r5,16(r5)
Dism: mtlr r5
Dism: blrl

M68k: BT.B #$fffffff8 == 0000001a (TRUE)

Mblk: save_memory_long
Dism: stw r15,64(r14)
Mblk: save_memory_word
Dism: sth r15,68(r14)
Mblk: load_register_long
Dism: lis r3,27606
Dism: ori r3,r3,45096
Mblk: save_memory_long
Dism: stw r3,76(r14)
Mblk: opcode_unsupported
Dism: li r3,24824
Dism: lis r4,27315
Dism: ori r4,r4,21752
Dism: bl 0x7f91acc0

Done compiling

Colorful, isn't it? :)

Okay, let's try to understand what is going on.

I marked the three Motorola 68k instruction that was compiled here with orange color, the code roughly looks like this:

1. Increase register D0 by one;

2. Put the content of register D0 to the address that is calculated by using register A0 plus offset of 0x180 (A0 was initialized previously with the value: 0xDFF000, which is the base of the custom chipset memory area) - in layman terms: load it to the background color.

3. Go back to step 1.

Now, let's see the second level of the list:

First of all the prefix "Mblk:" marks the macroblocks (white), "Dism:" is the actual PowerPC code (yellow).

As I already mentioned earlier: some macroblocks can be optimized away (although it is not implemented yet), and a macroblock means at least one PowerPC instruction, but it can be a series of instructions also.

The steps can be interpreted roughly as:

1.1. Load the arithmetic flags from the memory where the interpretive emulator stores them.

1.2. Load the previous content of the emulated D0 register into a PPC register.

1.3. Load the constant for the add instruction (one) into a PPC register.

1.4. Add the second register to the first one (increase D0 by one).

1.5. Save the arithmetic flags after the operation.

2.1. Load the previous content of the emulated A0 register into a PPC register.

2.2. Add the offset (0x180) to the content of A0 and load it into a new PPC register.

2.3. Check the content of the emulated D0 register to set up the arithmetic flags according to it.

2.4. Save back the modified D0 register to the memory for the interpretive emulator.

2.5. Calculate the offset and load the function address for the memory write operation handler and call it (namely the custom chipset write handler). This is a function from the interpretive emulation and it was written in C, therefore we must store all volatile registers back to the memory, the C code won't preserve these. (This is why we stored the D0 register in step 2.4.)

3.1. Save the arithmetic flags back to memory where the interpretive emulator stores them. (These were kept in a non-volatile register, so these were preserved while we called the helper function in step 2.5.)

3.2. Update the emulated PC register to the current state for the following instructions.

3.3. Call the interpretive emulation for the branch instruction (because it is not implemented yet, so we reuse the interpretive implementation).

4. Done. Phew.

Funny, eh? :)

If you are not familiar with assmebly then don't stretch yourself too much by trying to understand this techno-blahblah.

For the rest: who can spot what can be optimized on the compiled code?

Monday, April 16, 2012

Foundations of the House

To show some progress I checked in a massive changeset into the SourceForge SVN that represents the implementation of the following features:

Macroblock collection and handling
Code generation from the macroblock buffer
Macroblocks for many basic high level instructions that are used for the code translation intermediate representation
Temporary register allocation/freeing, flushing
Emulated 68k register mapping to temporary registers, automatic loading/saving from/to the interpretive emulator Regs structure
Compiling of pre- and post-code for the translated block
Temporary PPC register, emulated M68k register and flag flow-tracking for macroblocks
Basic functions for flag emulation implementation

Sounds quite a big chunk of work, and to tell you the truth it really was. As it seems so far my initial plans are working, the macroblocks are the implementation of the Microcode-VLIW-like approach I explained previously.

What is still missing for your happiness: there is no addressing mode and instruction implementation yet. So, basically this is still not too useful for the average users, thus there is no binary release yet.

But all of these are the foundations of the house and now I can go on with building the walls, which will be much more perceptive: addressing modes and instructions.

Unfortunately, I wasn't able to find the bug that blocks the Kickstart from running, maybe an other Kickstart file might help, we will see.

In the meanwhile enjoy this beautiful diagram of the (oversimplified) method of compiling in this implementation:

Code translation flow diagram