It is time to celebrate: finally my Amiga configuration is complete. It took more than a half year after moving to the opposite side of the globe (from Hungary to New Zealand). I dragged my poor, old A1-XE with me in a suitcase; unfortunately it didn't survive the flight. :(
Stephen “Cobra” Fellner lent a uA1 to me in a compact case, I couldn't be grateful enough for it. (Not to mention the tons of helps they gave us to start our new life here from scratch...)
Piece-by-piece the machine was completed; the final item was an old Philips monitor from TradeMe.
It is also time to reorganize priorities in my (rather limited) free time. Cut back on beaches, bushwalks and especially on Facebook. :P
But enough from whining, this is a technical blog after all and the autumn is coming rapidly with lotsa rain...
Let's scheme!
A JIT compiler similar to a programming language compiler, the main difference is the source: while a programming language source is human readable text (or that would be the goal, at least), the source in this case is machine code from a different processor.
The upside is the machine code has very strict rules, easier to interpret. The downside is it must be precisely known in every minor detail and there are undocumented features and side-effects that have to be implemented correctly. (How to find out these: good question; usually with countless hours of debugging crashing applications.)
Why JIT called just-in-time compiling: the program code translated to natively executable code while it is emulated. It means every time the execution reaches a point in the program that wasn’t reached before then the compiler takes a chunk from the emulated code and translates it to native code then the execution flow goes on. (Some JIT compilers are able to translate the whole executable right after loading, but that is only possible in special cases.)
The compilation process can be either fairly simple or overly complicated, depending on the actual method. The final result must be directly executable on the host system; in other case it would be inefficient and probably useless. (It would require an interpreter to execute, rarely makes any sense.)
There are different approaches to the compilation process, which one is the best choice depends on multiple factors.
Poor man’s JIT
For example the most simplistic method produces a series of jump instructions for execution of source data fetching, operation execution, result store. Everything else is done by a library of functions in the emulation environment. This solution gets rid of the interpretation of every instruction code at every (re-)execution, but it is next to impossible to do any optimization which would involve the source- and destination handling together with the operation; not to mention a wider span of instructions.
Why anybody would try such an inefficient solution: if the interpreter is already implemented then requires significantly less work to turn it into a JIT compiler by this way. Also the translated code needs (lots of) memory, such implementation is lightweight on the memory usage.
Stamping Lil’ Roses and Rainbows
Slightly better approach is creating templates for certain addressing modes and operations that can be copied into the translated code (almost) directly. Not every instruction incarnation needs its own template; it is possible filling up some gaps regarding the specifics to the actual instruction, such as the involved registers.
This is the most common approach, flexible and efficient, if implemented properly. With some tweaking it is even possible to adjust the templates to handle special cases, such as optimizing arithmetic flag emulation away when consecutive instructions would overwrite it anyway. This is how Petunia works and as I found out recently, something similar that WinUAE implements.
However, templates are a bit rigid sometimes and still not to easy (or even impossible) to join together more, than two instructions in a specific optimization, which would make use of certain aspect of the target processor. Creating truly flexible templates could result a big mess in translation functions.
Big Planz I haz it
I had a couple years after finishing (the never-ever finished) Petunia playing around with scenarios in my mind where emulated code builds up this or that way. Finally, I came to the conclusion that there is a possible better way to implement JIT compiling, and that is similar to the microcode, that used often in CISC processors.
Microcode is for reducing the complexity of the machine code instruction by implementing it as a series of very simple “wired” instructions and an “interpreter” executes the simple instructions one-by-one at each clock cycle.
Combining this technique with the VLIW approach, when the simple instructions are executed out-of-order, or even can be eliminated completely the result might be lot more optimal on the generated code.
How do I intend to implement… Similarly to the templates, each emulated instruction will be prepared as a series of virtual instructions. The compiler in the first round collects the virtual instructions for each emulated instructions into a buffer for a defined chunk of the emulated code.
In the second round an optimizer runs trough the buffer while trying to apply modifications on the virtual instructions according to predefined rules. At this level the virtual instructions and the original (emulated) instructions have no connection at all anymore, each virtual instruction can be handled, reorganized, eliminated on its own.
The third round is the code emitter: turns the virtual instructions into natively executable code using actual code templates.
I am sure it wasn’t me who thought on this solution for the first time, but never read about similar approach before in the case of the JIT compiling. Programming language compilers do similar code translation to maintain the portability of the compiler between the different processor architectures. (Code emitter has to be adapted, the rest of the compiler needs no modification.)
Predicted Roadblocks
Once I heard: if you were not able to summarize the problem then you won't be able to find the solution either. Let's find out the possible problems with JIT compiling for the complete machine emulation then.
1. Memory access emulation
Unless the 68k emulation in OS4 the UAE is a complete machine emulation. While an application is running it reads and writes memory (no news to anybody, I guess). If the accessed memory is plain data then there is not much to do with it: the application can do whatever it was planned for.
Unfortunately there are two types of memory access that needs special care: accessing hardware registers and writing into the executed code area. For the latter see below at self-modifying code, the former is lot more easily to handle.
Solution: basically what needed is incorporate the functions from UAE that are called for each memory accesses into the translated code.
2. Self-modifying code
The 68k processors don't make difference between data and code, although the AmigaOS itself is able to recognize what part of the loaded executable contains actual code. The difference was never enforced to the developers with all of its up- and downsides, they found it out by the hard way what can of worms is hidden there, when the processors with cache (like 68020) appeared in Amigas. Several self-modifying game and demo fails running on the cached code memory, because the code cache and the data cache is separated.
In AmigaOS4 I was able to get information from DOS.library regarding the loaded and removed code segments. By using this information I could tell which memory areas are cleared, I simply dropped the translated code for those.
With UAE the situation is completely different: any byte in the memory is potential target for modifying. It means the translated code must be dropped and retranslated; otherwise it would conserve the previously translated state.
Solution: probably I must extend the memory access checking and drop the translated code when it gets modified. I have to revisit this topic later on; checking for translated code at every memory access might be too slow.
3. Translated code lookup
When the execution jumps, branches to a new memory address the emulator has to know whether there is an already translated code or the new address was never hit before, new translation is needed. The translated parts are sometimes following each other, but often the programs are wondering around in the memory with no logic to follow.
Solution: Petunia had the same problem, I created a two level look-up table for the translated code segments for each address in the address space; a bit memory-hungry, but very quick for finding the address of the translated code.
As Michael Jackson would say: This is it. These are the initial plans, more details must follow, but first thing first: let’s try to compile E-UAE… :)
Excellent beginning Rachy.
ReplyDeleteTrevor
Thanks for doing this. :) Good luck with your project!
ReplyDeleteGood Work, i hope the Sam440ep/667Mhz will benefit from it for normal ECS/OCS-Speed. It would be a great thing!
ReplyDelete