I think this is related to the idea of Chained SYS functions
There are a number of peculiarities of native code, which I think will be familiar to anyone who's written any:
- It can't be shared between virtual machines - code that is part of a vCPU or v6502 instruction is destined always to return to its dispatch loop, and SYS functions are all linked to vCPU for the same reason. I'd like to be able to expose SYS functions in my Forth, and it's possible, but it always involves switching to vCPU and then back again.
- Where code paths diverge in native code (think if/else), it's often easier to keep them separate than to merge them again if the path lengths differ significantly (if they're close they can be balanced by nops). If you only have one point that returns to the dispatch loop, you can only return one value. It is possible to adjust vTicks as you go. Often you can organise code to do common work up-front, and branch only at the end. Routines with lots of conditional code are simply not a good fit for a SYS function.
- Native routines can't easily call each other: SYS functions can restart themselves by winding back vPC, but to cause a different SYS function to be called next time they'd have to adjust the worst-case cost in the instruction stream (unless the required value is close to the current value), and change sysFn (which would be observable to the programmer, but probably fine if documented). Calling another routine has to happen in tail-position, as there's no way to return - there's no agreed call stack for native code. There are examples of cases where common native code is used by more than one routine: The right-shift table uses the vTmp variable to store a continuation address, and after lookup, returns to one of several "tails" in the following page. I've done similar things. v6502 has an instruction register and the SBC instruction rewrites it to cause ADC to be called instead.
- Related to the previous: long SYS functions are slower than you might expect, because they cause vCPU to stall while waiting for a long enough time slice. It's advantageous to split things up into smaller chunks. Some algorithms can naturally do part of the work, and leave parameters in a state where they pick up where they left off when restarted, but it's not so obvious how to do it in other cases (perhaps requiring a flag to be passed saying "this is the first call", which is then cleared).
- In the right-shift example mentioned above, each SYS function or instructions that makes use of the table needs a different tail in page 6 (because they all need to return different costs to their dispatch loop). These routines do not tend to include any branching; their best case cost is the same as their worst case cost. This page is getting more and more crowded, and I think there may come a time when no more code can easily be added.
- If we introduced a consistent notion of a continuation address as a variable, code could set this and then yield to the display loop - like a more general version of self-restarting SYS functions. This is somewhat similar to how vTmp is used by the right-shift table, specifying which code to run after an operation. Using sysFn for this might be ok. In many cases code will know precisely the time required for the next step, so perhaps we could provide that, and avoid unnecessary waiting, alternatively always allow code to run for maxTicks. We'd have shared code that looked at vTicks and resumes immediately if possible - almost like a new virtual machine (and certainly requiring a new value for vCpuSelect).
- If we introduced a page indirection when returning from SYS functions or similar (probably through vCpuSelect - which might need some code to be moved), code that has finished its job (rather than just yielding before continuing) could either return to a virtual machine (as in SYS functions), or to some other native code through the continuation address. It could turn SYS functions into reusable subroutines, especially if...
- We could take this further and have a limited stack, and routines to push and restore the continuation address - potentially very useful, even if it's a little slow. We'd certainly need to be able to save and restore vCpuSelect anyway. Where we put this would be up for debate. Up until now using the vCPU stack in the zero-page would seem the obvious thing to do, but with at67's upcoming changes, maybe not. I'm not sure that we need a big stack - perhaps enough to store vCpuSelect, and 3 or 4 other calls?
- It might be advantageous if native routines returned time saved against the worst-case cost, rather than the total runtime. I.e. save vTicks before running code. This is the bit I'm least certain about. I've been thinking of making this change in my Forth, as I think the vast majority of my code doesn't have uneven branch-lengths and the worst-case cost is the same as the real cost (and, for now, Forth always knows the worst-case cost precisely). This could allow merging common tails, e.g. in page 6. but I'm unsure of the impact on backwards compatibility. Perhaps a new SYS instruction could help? This change might slow down existing code somewhat.