Additional facilities for native code programming / a new SYS function calling convention

qwertyface · Post by **qwertyface** » 24 Jun 2021, 12:09

Hi everyone, I've been pondering a bit on the native-code / SYS function interface, and if and how it could be changed to support a more structured style of native-code programming. I'm largely just writing this to get these ideas out of my head (where they've been quite distracting).

I think this is related to the idea of Chained SYS functions

There are a number of peculiarities of native code, which I think will be familiar to anyone who's written any:

It can't be shared between virtual machines - code that is part of a vCPU or v6502 instruction is destined always to return to its dispatch loop, and SYS functions are all linked to vCPU for the same reason. I'd like to be able to expose SYS functions in my Forth, and it's possible, but it always involves switching to vCPU and then back again.
Where code paths diverge in native code (think if/else), it's often easier to keep them separate than to merge them again if the path lengths differ significantly (if they're close they can be balanced by nops). If you only have one point that returns to the dispatch loop, you can only return one value. It is possible to adjust vTicks as you go. Often you can organise code to do common work up-front, and branch only at the end. Routines with lots of conditional code are simply not a good fit for a SYS function.
Native routines can't easily call each other: SYS functions can restart themselves by winding back vPC, but to cause a different SYS function to be called next time they'd have to adjust the worst-case cost in the instruction stream (unless the required value is close to the current value), and change sysFn (which would be observable to the programmer, but probably fine if documented). Calling another routine has to happen in tail-position, as there's no way to return - there's no agreed call stack for native code. There are examples of cases where common native code is used by more than one routine: The right-shift table uses the vTmp variable to store a continuation address, and after lookup, returns to one of several "tails" in the following page. I've done similar things. v6502 has an instruction register and the SBC instruction rewrites it to cause ADC to be called instead.
Related to the previous: long SYS functions are slower than you might expect, because they cause vCPU to stall while waiting for a long enough time slice. It's advantageous to split things up into smaller chunks. Some algorithms can naturally do part of the work, and leave parameters in a state where they pick up where they left off when restarted, but it's not so obvious how to do it in other cases (perhaps requiring a flag to be passed saying "this is the first call", which is then cleared).
In the right-shift example mentioned above, each SYS function or instructions that makes use of the table needs a different tail in page 6 (because they all need to return different costs to their dispatch loop). These routines do not tend to include any branching; their best case cost is the same as their worst case cost. This page is getting more and more crowded, and I think there may come a time when no more code can easily be added.

So I've been thinking about how any of these issues could be solved, and whether they're worth solving. I always find it hard to tell without having a go, but I thought it would be interesting to see what other people think. Doing any of these in a backwards compatible way might be hard or impossible. Perhaps best to regard this as a question of would you do it differently if starting from scratch!

If we introduced a consistent notion of a continuation address as a variable, code could set this and then yield to the display loop - like a more general version of self-restarting SYS functions. This is somewhat similar to how vTmp is used by the right-shift table, specifying which code to run after an operation. Using sysFn for this might be ok. In many cases code will know precisely the time required for the next step, so perhaps we could provide that, and avoid unnecessary waiting, alternatively always allow code to run for maxTicks. We'd have shared code that looked at vTicks and resumes immediately if possible - almost like a new virtual machine (and certainly requiring a new value for vCpuSelect).
If we introduced a page indirection when returning from SYS functions or similar (probably through vCpuSelect - which might need some code to be moved), code that has finished its job (rather than just yielding before continuing) could either return to a virtual machine (as in SYS functions), or to some other native code through the continuation address. It could turn SYS functions into reusable subroutines, especially if...
We could take this further and have a limited stack, and routines to push and restore the continuation address - potentially very useful, even if it's a little slow. We'd certainly need to be able to save and restore vCpuSelect anyway. Where we put this would be up for debate. Up until now using the vCPU stack in the zero-page would seem the obvious thing to do, but with at67's upcoming changes, maybe not. I'm not sure that we need a big stack - perhaps enough to store vCpuSelect, and 3 or 4 other calls?
It might be advantageous if native routines returned time saved against the worst-case cost, rather than the total runtime. I.e. save vTicks before running code. This is the bit I'm least certain about. I've been thinking of making this change in my Forth, as I think the vast majority of my code doesn't have uneven branch-lengths and the worst-case cost is the same as the real cost (and, for now, Forth always knows the worst-case cost precisely). This could allow merging common tails, e.g. in page 6. but I'm unsure of the impact on backwards compatibility. Perhaps a new SYS instruction could help? This change might slow down existing code somewhat.

Post by **at67** » 04 Jul 2021, 01:16

qwertyface wrote: ↑24 Jun 2021, 12:09
It can't be shared between virtual machines - code that is part of a vCPU or v6502 instruction is destined always to return to its dispatch loop, and SYS functions are all linked to vCPU for the same reason. I'd like to be able to expose SYS functions in my Forth, and it's possible, but it always involves switching to vCPU and then back again.

If the epilogues of all SYS routines were modified in the following way, any dispatcher should theoretically be able to call them without having to switch to vCPU land and back. Obviously this introduces backwards compatibility issues but shouldn't be a problem for future SYS calls, (I use this paradigm in a number of places within ROMvX0).

Code: Select all

ld(hi('REENTER'),Y)             #35,
jmp(Y,'REENTER')                #36,
ld(-40/2)                       #37,

to:

Code: Select all

ld([vCpuSelect])                #35 restore dispatch page
adda(1,Y)                       #36
jmp(Y,'NEXTY')                  #37
ld(-40/2)                       #38

Each new dispatch page requires this thunk

Code: Select all

# SYS calls and interrupts
fillers(until=0xca)
ld(-28/2)                       #25
bra('NEXT')                     #26 Return from SYS calls
ld([vPC+1],Y)                   #27

qwertyface wrote: ↑24 Jun 2021, 12:09 [*]Where code paths diverge in native code (think if/else), it's often easier to keep them separate than to merge them again if the path lengths differ significantly (if they're close they can be balanced by nops). If you only have one point that returns to the dispatch loop, you can only return one value. It is possible to adjust vTicks as you go. Often you can organise code to do common work up-front, and branch only at the end. Routines with lots of conditional code are simply not a good fit for a SYS function.

I tend to un-nest native code conditionals into their separate branch paths, not only does it make the code easier to understand and maintain, (sometimes substantially), but it also usually saves a few cycles. The tradeoff is ROM space, but that is the one Gigatron resource that we have an abundance of.

qwertyface wrote: ↑24 Jun 2021, 12:09 [*]Native routines can't easily call each other: SYS functions can restart themselves by winding back vPC, but to cause a different SYS function to be called next time they'd have to adjust the worst-case cost in the instruction stream (unless the required value is close to the current value), and change sysFn (which would be observable to the programmer, but probably fine if documented). Calling another routine has to happen in tail-position, as there's no way to return - there's no agreed call stack for native code. There are examples of cases where common native code is used by more than one routine: The right-shift table uses the vTmp variable to store a continuation address, and after lookup, returns to one of several "tails" in the following page. I've done similar things. v6502 has an instruction register and the SBC instruction rewrites it to cause ADC to be called instead.

Agreed that this is not a trivial problem to solve, a naive solution would be to keep a LUT of SYS call addresses and SYS ticks in RAM and then create a new dispatcher that is used only for chaining SYS calls, (a null entry or invalid index could then switch back into vCPU land/main dispatcher). You could also forego the SYS ticks and have it embedded within the SYS routine itself, (this would not work with the current SYS instruction and would only make sense in new SYS routines if you wanted backwards compatibility), thereby reducing the LUT entries size from 3 bytes to 2 bytes.

qwertyface wrote: ↑24 Jun 2021, 12:09 [*]Related to the previous: long SYS functions are slower than you might expect, because they cause vCPU to stall while waiting for a long enough time slice. It's advantageous to split things up into smaller chunks. Some algorithms can naturally do part of the work, and leave parameters in a state where they pick up where they left off when restarted, but it's not so obvious how to do it in other cases (perhaps requiring a flag to be passed saying "this is the first call", which is then cleared).

100% true, SYS routines should be coded to fit into one of the 28, 36, 48, or 74 cycle breakpoints to make maximum use of the typical non-vblank 148 cycle scanline. But an 80 cycle SYS routine that is 10 times faster than the equivalent vCPU routine, even with it's only slightly higher than 50% efficiency is still going to smack-a-doodle the equivalent vCPU routine in terms of execution speed. Perfect examples are the ROMvX0 signed 16 mult/div SYS routines that are on average 350% to 400% faster than equivalent vCPU code, (measured in applications that can trivially switch between SYS/vCPU).

qwertyface wrote: ↑24 Jun 2021, 12:09 [*]In the right-shift example mentioned above, each SYS function or instructions that makes use of the table needs a different tail in page 6 (because they all need to return different costs to their dispatch loop). These routines do not tend to include any branching; their best case cost is the same as their worst case cost. This page is getting more and more crowded, and I think there may come a time when no more code can easily be added.
[/list]

I have already modified page6 by removing and moving some of the SYS calls there, (see one of my previous posts), and would eventually like to retire all the shift right sys calls to a more space efficient, (if slight longer in execution time), generic routine.

qwertyface wrote: ↑24 Jun 2021, 12:09

If we introduced a consistent notion of a continuation address as a variable, code could set this and then yield to the display loop - like a more general version of self-restarting SYS functions. This is somewhat similar to how vTmp is used by the right-shift table, specifying which code to run after an operation. Using sysFn for this might be ok. In many cases code will know precisely the time required for the next step, so perhaps we could provide that, and avoid unnecessary waiting, alternatively always allow code to run for maxTicks. We'd have shared code that looked at vTicks and resumes immediately if possible - almost like a new virtual machine (and certainly requiring a new value for vCpuSelect).

This sounds completely feasible.

qwertyface wrote: ↑24 Jun 2021, 12:09 [*]If we introduced a page indirection when returning from SYS functions or similar (probably through vCpuSelect - which might need some code to be moved), code that has finished its job (rather than just yielding before continuing) could either return to a virtual machine (as in SYS functions), or to some other native code through the continuation address. It could turn SYS functions into reusable subroutines, especially if...
[*]We could take this further and have a limited stack, and routines to push and restore the continuation address - potentially very useful, even if it's a little slow. We'd certainly need to be able to save and restore vCpuSelect anyway. Where we put this would be up for debate. Up until now using the vCPU stack in the zero-page would seem the obvious thing to do, but with at67's upcoming changes, maybe not. I'm not sure that we need a big stack - perhaps enough to store vCpuSelect, and 3 or 4 other calls?

Having a stack for native code SYS routines would certainly offer some major advantages for reusable subroutines, local vars and potentially even recursion.

qwertyface · Post by **qwertyface** » 05 Jul 2021, 13:04

Thanks for taking the time to reply - I'm glad that you don't think my thoughts are completely crazy! I still want to have a play with this idea, but it's about three down on my stack of Gigatron things I want to do.

I see a couple of major issues with the approach I described above:

If we imagine that we have a "yield" routine, that checks vTicks, and potentially either resumes immediately, or returns to the displayloop, then there are three bits of information that routine needs - the ticks taken (or ticks saved) so far, the ticks required for the next period of execution, and the continuation address. We can write directly to a variable for the continuation address, and pass one of the other pieces in AC, but what about the other? I think I this might become clearer with experimentation - I don't know if in practice we always need all of the information.
Variable space. Is there a backwards compatible place that we could put this information? I kinda wish Marcel had allocated some of page 0 as "for future use".

Post by **at67** » 19 Jul 2021, 09:41

qwertyface wrote: ↑05 Jul 2021, 13:04 Variable space. Is there a backwards compatible place that we could put this information? I kinda wish Marcel had allocated some of page 0 as "for future use".

I guess this is with future ROM's in mind, if so then you could just use the 0x3X space; Marcel reserved 0x30 to 0x33 for VBlank interrupts and I have extended that to 0x30 to 0x34 for ROMvX0.

As long as each ROM clearly defines how it uses this area of RAM and ROM authors collaborate, I don't see there being to many serious issues. Obviously old SW would need to be re-compiled/re-assembled with the new zero page limitations in mind, but using the GT1 bespoke ROM naming scheme that Marcel defined I also don't see that as a real problem.

Gigatron Hackers

Additional facilities for native code programming / a new SYS function calling convention

Additional facilities for native code programming / a new SYS function calling convention

Re: Additional facilities for native code programming / a new SYS function calling convention

Re: Additional facilities for native code programming / a new SYS function calling convention

Re: Additional facilities for native code programming / a new SYS function calling convention