The Big E-UAE JIT blog: WinUAE

Showing posts with label WinUAE. Show all posts

Monday, November 3, 2014

PPCJITBETA05 (Final Countdown)

Yes, yes, my dear friends: we are very close now! This is the final countdown indeed: the last beta before the final release of 1.0.

Are you excited? I bet you are. In the meanwhile download and enjoy the new beta:

https://sourceforge.net/projects/euaeppcjit/files/0.8.29-PPCJITBETA05

To sum up what you will get with the new beta: mainly bugfixes. All the features are locked down for the release, the ticket list for release 1.0 is practically empty. I am waiting for any bugs you find before the release.

So, please do report bugs you find. But again: it is very important to try to verify the issue before you decide about reporting it. Please follow the steps which are documented in the README file.

Your efforts are greatly appreciated.

JIT compatibility diagnostics

I have rejected a few bugs due to the fact that these programs are not compatible with the current JIT compiling implementation. The reason is very simple: if the program is trying to modify itself without flushing the instruction cache properly then the modified code won't be recompiled and the program will misbehave. Like this program: Where Time Stood Still.

These programs might work on a real processor and still fail with the JIT compiling because the cache handling is not emulated exactly the same how it would behave on a real processor. (Namely: the number of cache lines are much larger than on a real processor, so more code is "cached".)

Although this is not ideal, but for now only a handful of programs are depending on the cache size, so it won't cause too much trouble.

How could you tell that the program is compatible with the JIT compiler?

That is a very valid question. And here is the answer: there is a option for that! (At least there is now.)
I have wasted so much time on chasing errors coming from this issue that finally I have decided I implement a diagnostic configuration option for it. It is called:
comp_test_consistency

Usual disclaimer: read the documentation and if you didn't understand what does it do then don't turn it on.

Actually, it is pretty simple: in addition to every compiled block of instructions the compiler also compiles a check which compares the original content of the memory which was used for the compiling to the current content. If it doesn't match then the emulation stops.

Basically, it can be used for verifying if the program misbehaves because it does not flush the instruction cache properly, or there is some other reason.

It is safe to keep it turned on, but it slows down the emulation (sometimes considerably), so use it only when it is needed.

X1000

As I already announced in the previous post (couple months ago): the Amiga X1000 optimized version is available in the package. Please use only that version on Amiga X1000, the generic build won't work properly.

060

Due to the "popular demand" (more than one request ;) I have fixed the cache handling of the emulated 68060 processor type.
Please note: it is stated in the documentation from the original E-UAE that the 060 support is not implemented and all I had done was: fixed the cache bits to let the emulator turn on the caches, so the JIT can be activated. However, it is highly likely that there will be problems with some programs running on 060, since other important aspects of the 060 is not implemented (like proper stackframe).
So, watch your steps while using it. (And please don't report bugs for it. kthxbai)

RuninUAE

ChrisH kindly implemented the support for the betas in RuninUAE. You can turn on the JIT emulation from the menu now. (You guys are too spoiled!) Thanks Chris!

WinUAE PPC

In case you haven't heard: WinUAE is capable of emulating PowerPC hardware through QEmu on intel compatible processors and run PPC apps and even AmigaOS4.

The obvious question: how much better would it be running PowerPC programs on actual PowerPC processor under hardware emulation? :)
Probably it would be possible to make use of the native PowerPC processor and it would be fast. Very fast.
Not to mention it would open the door for AmigaOS4 running on Macintosh iBooks for example.

Do I have plans implementing it? No, sorry.
Thanks for asking.

Thanks

Finally, some thank-you's to the lovely people who helped me in this beta.

Big thanks goes to: MickJT, Mike Blackburn, Chris Handley, Luigi Burdo, Samir Hawamdeh, Raziel and Cass.

Some guys just can't stay away as it seems. :)

See you soon.

Thursday, August 30, 2012

Optimize It

I am back from the holidays for sometime now, but I got swamped instantly by the work in my daytime job. I had very little energy on the project in the nights, which was spent on two things: chasing that #@!% bug which is killing me and implementing the data-flow optimization.

Needless to say that I failed to track down the bug again, just like every time I spent hours on looking at couple megabytes of log dumps. Next time I will try a new approach, suggested by Stephen Fellner: reducing the code complexity while the bug can be reproduced.
We will see how it goes. The processor cache emulation is certainly complicates things, that can be eliminated at least.

While I acknowledged the failure again, I gathered all my previous thoughts and implemented the data-flow based optimization which does a really nice job.

The update is small, but highly important again:

Implementation of code optimization.
New configuration was introduced: comp_optimize to turn it on/off.
Bugs in the register input/output flag specifications are fixed.
Fixed too small string buffer in the 68k disassembler, previously the debugger crashed every time when memory was disassembled to the screen.
Implementation of MOVEM.x regs,-(Ay)

What, how and why?

As I tried to explain earlier in this post some of the compiled code is completely useless. The emulated instruction consists of the following parts usually:

Initialization of certain temporary registers;
Executing the actual operation;
Alter the arithmetic flags according to the result;
Save the result somewhere (into memory)

Some of these operations are not common or can be done for a series of instructions ahead, like loading the previous data for the emulated registers into temporary registers, but there are parts that cannot be avoided if we examine the emulated instruction out of its context.

For example the arithmetic flags are overwritten quite often by the following instructions, what means: if the next instruction(s) are not depending directly on the flag results then we can remove the code that generates these.

Typically, if we (as mighty humans) look at the instructions in a block we can easily identify the context, we can tell mostly which instructions produce usable results and what is not important. Like in most of the cases simple for a human, but it is a really complex job for the code compiler.

As I previously described: the emulated code is broken up into macroblocks by the precompiling. These blocks represent an "atomic" operation, like load data into temporary register, compare two registers or calculate certain arithmetic flags from the previous result.
When I started working on this concept I figured out that if I would be able to identify exactly for each macroblocks the previous result(s) that it is depending on and the result(s) that it produces then I can evaluate the dependencies between the subsequent macroblocks.

Going with the flow

For Petunia I already implemented a similar concept, but that was limited to the flag usage and it is not able to split up the emulation of the arithmetic flags into individual flag registers: it is either emulated completely or removed completely. (Which means that even if we needed only the C register later the rarely used X register will be calculated too, usually after the C register was already done. If you don't understand the reason behind this: don't worry - you need to know more about 68k assembly.)

For E-UAE the implementation is radically different. For any macroblock there is the possibility of specifying each and every emulated register, flag or temporary register as dependecy (input) and/or as result (output).

In the recent changes I implemented the rather simple solution for calculating the data-flow for each registers after all macroblocks are collected for a block of instructions.

It is capable of finding out for each macroblock whether the produced results are relevant for the following instructions in the block or not. If not then that macroblock can be eliminated from the compiled code because no instruction is depending on its results.

How to use it

Although, the optimization is completely safe (it won't remove any code that is essential) while the emulation is not stable enough there might be some bugs. So, I introduced a new configuration for turning it on/off, called: comp_optimize.
It replaces the comp_nf configuration from the x86 JIT implementation, because it is not just about the flags (nf = no flags).

If it was set to true then the data-flow calculation is done and some macroblocks will be removed. By setting it to false the emulation compiles all the instructions fully into the buffer.

The results

And finally some speed tests... Compared to the previously published Mandelbrot test results the actual numbers are:

Interpretive: 108 seconds;
JIT compiled without optimization: 52 seconds;
JIT compiled with optimization: 32 seconds.

That is roughly 40% speed increase in the case of this (heavily arithmetic) test.

The test system was: Micro AmigaOne (G3/800 MHz) - let's compare it to WinUAE that is running on my laptop (Intel Core i3 M350/2.27 GHz):

JIT compiled with all the possible optimizations turned on: 9 seconds.

It would be a tough job to compare these two computers, but I am pretty sure that my laptop is more than 3.5x faster than that poor old G3 machine.

I am really content with these results for now.

Monday, May 2, 2011

Is it alive, Igor?

While I had my expectations for setting up the cross-compiling environment, I failed to fulfill my own agenda: it just doesn't want to work. I have tried to set up a Cygwin environment, but the configuration script behaves odd and fooling around a few hours I lost the battle. Not a big issue yet, since I wasn't able to set up net connection on the Amiga anyway and transferring files on a flash drive isn't exactly comfortable.

So, instead of wasting more time on that, I rather started to create the test environment for the JIT.

To limit the scope of the must-be-implemented instructions, I figured out that I can prepare a spoofed Kickstart ROM file that contains exactly what I want to test at the moment. After a while I found my age old Mandelbrot test that fancies only Motorola 68000 instructions and prepared in sufficiently hardware-banging fashion - best candidate.

The development environment for the test code: PhxAss for compiling, vlink (from the ominous VBCC package) for linking.

You might ask why vlink was necessary when PhxAss is able to do the linking. The answer lies in the Kickstart ROM file format: it is a plain raw binary dump, that starts from $FFF80000 in memory. This format can be manufactured by a number of ways, but PhxAss is not exactly prepared for it.

I have never tried to create a raw binary from a relocate-able Amiga executable before, so I had to tinker with the linker script first a little while, but finally managed it.

E-UAE has some restrictions on the size of the ROM file, it must match one of the legitimate sizes from the real hardware: either 256 or 512 KB sized. (There are other special sizes for Amiga 1000, or the CD-32, but generally these wouldn't help much.)

Finally, the ROM file was ready. I knew that it gets loaded to that high memory address ($FFF80000), and I slightly remember that it gets overlaid to the CHIP RAM (to the zero address of the memory address space, actually) with a hardware trickery right after reset.

The reason is quite simple: the Motorola processor takes the first two longwords from the memory for the initial program counter (PC) and the supervisor stack pointer (SSP) right after start-up. To put some meaningful data into these addresses either the ROM should start from the beginning of the address space (technically that would be address $00000000), or some hardware magic has to be done. The latter is the case of the Amiga systems. When the ROM started from the real address, the overlay goes away by setting some of the custom registers... But which register exactly? It is not a common code to start-up an Amiga right from clean reset... Luckily, others already done the hard work: check out this link from the section "A Quick explanation of what happens and why".

Everything seemed nice, however I never came around figuring out how to turn on AGA chipset after a plain reset. Until you start the SetPatch command, the system behaves as OCS chipset does, and now I don't even have the SetPatch command since I don't have any part of the system without the actual ROM. Not good, not good!

I have spent a whole day finding out the answer. Well spent time indeed, by the way: I refreshed my memory about Amiga hardware and coding for the legacy machines. The result is included in the sources, in case you are curious. (The "magic" FMODE register.)

You can download this initial test package ~~from here~~ from the SourceForge page of the project, it comes handy if you would like to start writing your own Amiga emulator (haha).

Please note again: you will need the PhxAss and the VBCC package for 68k target if you would like to rebuild. To do the compiling and linking just type: "build mandel_hw" into the command line and here we go. (If you have the compiling tools in your path somewhere, of course.)

Before you are rushing to shovel it into E-UAE: while WinUAE won't bother checking the Kickstart ROM, and you can basically load anything into it, E-UAE is trying to do checksum on the binary image. I was too lazy to create the proper checksum, I rather modified the E-UAE sources to skip the checksums. (Meaning: you won't be able to load the test into the usual E-UAE or derivatives for now.)

I have to add that I don't know exactly who created this Mandelbrot set calculation. It wasn't me, I got the source from a magazine some time ago. I guess the author won't be bothered that I published his/her(?) work again. So, let's say these sources are Public Domain for now.

How to proceed... Well, I still wasn't able to manage to prepare the poor man's JIT compiling. It is not working yet, there are some issues with it. This is coming on the way next.

Sunday, March 6, 2011

Blueprints

It is time to celebrate: finally my Amiga configuration is complete. It took more than a half year after moving to the opposite side of the globe (from Hungary to New Zealand). I dragged my poor, old A1-XE with me in a suitcase; unfortunately it didn't survive the flight. :(

Stephen “Cobra” Fellner lent a uA1 to me in a compact case, I couldn't be grateful enough for it. (Not to mention the tons of helps they gave us to start our new life here from scratch...)

Piece-by-piece the machine was completed; the final item was an old Philips monitor from TradeMe.

It is also time to reorganize priorities in my (rather limited) free time. Cut back on beaches, bushwalks and especially on Facebook. :P

But enough from whining, this is a technical blog after all and the autumn is coming rapidly with lotsa rain...

Let's scheme!

A JIT compiler similar to a programming language compiler, the main difference is the source: while a programming language source is human readable text (or that would be the goal, at least), the source in this case is machine code from a different processor.

The upside is the machine code has very strict rules, easier to interpret. The downside is it must be precisely known in every minor detail and there are undocumented features and side-effects that have to be implemented correctly. (How to find out these: good question; usually with countless hours of debugging crashing applications.)

Why JIT called just-in-time compiling: the program code translated to natively executable code while it is emulated. It means every time the execution reaches a point in the program that wasn’t reached before then the compiler takes a chunk from the emulated code and translates it to native code then the execution flow goes on. (Some JIT compilers are able to translate the whole executable right after loading, but that is only possible in special cases.)

The compilation process can be either fairly simple or overly complicated, depending on the actual method. The final result must be directly executable on the host system; in other case it would be inefficient and probably useless. (It would require an interpreter to execute, rarely makes any sense.)

There are different approaches to the compilation process, which one is the best choice depends on multiple factors.

Poor man’s JIT

For example the most simplistic method produces a series of jump instructions for execution of source data fetching, operation execution, result store. Everything else is done by a library of functions in the emulation environment. This solution gets rid of the interpretation of every instruction code at every (re-)execution, but it is next to impossible to do any optimization which would involve the source- and destination handling together with the operation; not to mention a wider span of instructions.

Why anybody would try such an inefficient solution: if the interpreter is already implemented then requires significantly less work to turn it into a JIT compiler by this way. Also the translated code needs (lots of) memory, such implementation is lightweight on the memory usage.

Stamping Lil’ Roses and Rainbows

Slightly better approach is creating templates for certain addressing modes and operations that can be copied into the translated code (almost) directly. Not every instruction incarnation needs its own template; it is possible filling up some gaps regarding the specifics to the actual instruction, such as the involved registers.

This is the most common approach, flexible and efficient, if implemented properly. With some tweaking it is even possible to adjust the templates to handle special cases, such as optimizing arithmetic flag emulation away when consecutive instructions would overwrite it anyway. This is how Petunia works and as I found out recently, something similar that WinUAE implements.

However, templates are a bit rigid sometimes and still not to easy (or even impossible) to join together more, than two instructions in a specific optimization, which would make use of certain aspect of the target processor. Creating truly flexible templates could result a big mess in translation functions.

Big Planz I haz it

I had a couple years after finishing (the never-ever finished) Petunia playing around with scenarios in my mind where emulated code builds up this or that way. Finally, I came to the conclusion that there is a possible better way to implement JIT compiling, and that is similar to the microcode, that used often in CISC processors.

Microcode is for reducing the complexity of the machine code instruction by implementing it as a series of very simple “wired” instructions and an “interpreter” executes the simple instructions one-by-one at each clock cycle.

Combining this technique with the VLIW approach, when the simple instructions are executed out-of-order, or even can be eliminated completely the result might be lot more optimal on the generated code.

How do I intend to implement… Similarly to the templates, each emulated instruction will be prepared as a series of virtual instructions. The compiler in the first round collects the virtual instructions for each emulated instructions into a buffer for a defined chunk of the emulated code.

In the second round an optimizer runs trough the buffer while trying to apply modifications on the virtual instructions according to predefined rules. At this level the virtual instructions and the original (emulated) instructions have no connection at all anymore, each virtual instruction can be handled, reorganized, eliminated on its own.

The third round is the code emitter: turns the virtual instructions into natively executable code using actual code templates.

I am sure it wasn’t me who thought on this solution for the first time, but never read about similar approach before in the case of the JIT compiling. Programming language compilers do similar code translation to maintain the portability of the compiler between the different processor architectures. (Code emitter has to be adapted, the rest of the compiler needs no modification.)

Predicted Roadblocks

Once I heard: if you were not able to summarize the problem then you won't be able to find the solution either. Let's find out the possible problems with JIT compiling for the complete machine emulation then.

1. Memory access emulation

Unless the 68k emulation in OS4 the UAE is a complete machine emulation. While an application is running it reads and writes memory (no news to anybody, I guess). If the accessed memory is plain data then there is not much to do with it: the application can do whatever it was planned for.

Unfortunately there are two types of memory access that needs special care: accessing hardware registers and writing into the executed code area. For the latter see below at self-modifying code, the former is lot more easily to handle.

Solution: basically what needed is incorporate the functions from UAE that are called for each memory accesses into the translated code.

2. Self-modifying code

The 68k processors don't make difference between data and code, although the AmigaOS itself is able to recognize what part of the loaded executable contains actual code. The difference was never enforced to the developers with all of its up- and downsides, they found it out by the hard way what can of worms is hidden there, when the processors with cache (like 68020) appeared in Amigas. Several self-modifying game and demo fails running on the cached code memory, because the code cache and the data cache is separated.

In AmigaOS4 I was able to get information from DOS.library regarding the loaded and removed code segments. By using this information I could tell which memory areas are cleared, I simply dropped the translated code for those.

With UAE the situation is completely different: any byte in the memory is potential target for modifying. It means the translated code must be dropped and retranslated; otherwise it would conserve the previously translated state.

Solution: probably I must extend the memory access checking and drop the translated code when it gets modified. I have to revisit this topic later on; checking for translated code at every memory access might be too slow.

3. Translated code lookup

When the execution jumps, branches to a new memory address the emulator has to know whether there is an already translated code or the new address was never hit before, new translation is needed. The translated parts are sometimes following each other, but often the programs are wondering around in the memory with no logic to follow.

Solution: Petunia had the same problem, I created a two level look-up table for the translated code segments for each address in the address space; a bit memory-hungry, but very quick for finding the address of the translated code.

As Michael Jackson would say: This is it. These are the initial plans, more details must follow, but first thing first: let’s try to compile E-UAE… :)

Monday, February 21, 2011

Hello and welcome on the E-UAE JIT developer blog!

The purpose of this blog is documenting the development of the extension of the E-UAE Amiga emulator with the possibility of making use of Just-In-Time compile based Motorola 680x0 processor emulation.

A short lesson of history

WinUAE, the Amiga emulator for Windows have JIT compiling for many years now. Unfortunately, it is closely tied to the intel x86 architecture, because the most efficient way of implementing the JIT compiling is kinda similar to an actual programming language compiler: the end result is machine code, which is executed directly. Although it is possible to implement a processor independent JIT compiler, but to squeeze more speed from the executed code in a general compiling model is much more complex.

Recent Amiga (like) computers are using PowerPC processors, porting the WinUAE solution to PPC processor would be closely as hard to do as implementing a brand new solution. Not to mention that there are special requirements from the environment of the emulation, that cannot be simply resolved.

Since not many coders have experience with JIT compiling on the whole world, the line of applicants for creating JIT compiling for PowerPC processor was pretty short for years.

On the other side, users needed the JIT compiling for UAE emulation for running all sorts of the legacy applications, which cannot be run on the recent incarnation of the AmigaOS systems (because those are buggy, not system-friendly, hitting the hardware directly, and a number of other reasons). The demand was so high that even a bounty was set up for this specific project on the AmigaBounty site, yet nobody wanted to take the job.

The insane with a rattle

You might ask why anybody would invest countless hours into this project. And I must say it is a valid question.

I have already experience in this field, I completed the similar component in AmigaOS4 (code name: Petunia). If you are interested in the fine details of the JIT compiling, then I suggest checking out my Project Petunia web page.

There are a few other reasons; some folks were bugging me with it for years and it is a bit of a challenge. Also there is this typical feeling in every coder’s head, when a project finished: trying to implement it second time the outcome would be much better because of the previous experience.

Well, here is the time to see how well it goes…

Principles to follow

There are some ground rules I will try to keep to make as many future users happy as possible. These are:

The source files will be freely available to anybody under GPL license or directly from me with a different license on request.
While the development will be done on AmigaOS4, I won’t use any AmigaOS4-related feature, which makes it harder to port the solution to any platform. (It means my previous project won’t be used directly in any form.)
I will try to keep the solution as flexible as possible, to let others make use of it without E-UAE.
I won’t sacrifice compatibility over speed.
I will implement some possible adjustments to let the user fine-tune the JIT and therefore improve compatibility with the old applications.

That is all for today, more technical details will follow in the next post.