Page 8 of 8

Re: New vCPU instructions 2.0

Posted: 03 May 2021, 20:56
by lb3361
I didn't just mean it for prefix instructions. I just observe that half of the page3 instructions waste three cycles adjusting vPC. For these instructions, vPC is adjusted twice: once in the instruction body (+/-1) and once in the dispatch code (+2). Three cycles, that's 10% of maxTick, an obvious target for optimisation. This can be fixed by having three different REENTERs in page 3, at the cost of maybe 15 instruction slots, which is sizeable..

For comparison, moving the opcode implementation out of page 3 costs a priori six cycles. You were able to reduce it to around two or three cycles on average with hand optimization, which is quite a feat. Recovering the three cycles lost adjusting vPC on half the instructions is like half what you've already done.

In any case, since you're the one doing this work, you not only decide what you're doing, but you most certainly will beat me to it. :-)

Re: New vCPU instructions 2.0

Posted: 04 May 2021, 11:10
by qwertyface
Just a couple of comments:

I see you've moved code some code to page 0. If I ever complete and merge my Forth, I need a short routine in page 0 (currently 5 instructions). Could you try to leave at least that much space for my purposes? The location isn't important.

I see that you're making changes to vCpuSelect. In previous ROMs, this points to the page below the interpreter page to save space inside the interpreter, but the additional jump costs time on each entry. If you're moving to multiple interpreter pages, and have saved some space in page three, perhaps this could be changed to point at the page itself (with a defined layout for an interpreter entry page). As well as saving the jump on each entry, this could potentially allow sysFunctions to use4 this to jump to the right interpreter page, rather than always returning to vCPU - although I haven't thought carefully about the consequences of this.

Re: New vCPU instructions 2.0

Posted: 04 May 2021, 18:06
by at67
lb3361 wrote: 03 May 2021, 20:56 I didn't just mean it for prefix instructions. I just observe that half of the page3 instructions waste three cycles adjusting vPC. For these instructions, vPC is adjusted twice: once in the instruction body (+/-1) and once in the dispatch code (+2). Three cycles, that's 10% of maxTick, an obvious target for optimisation. This can be fixed by having three different REENTERs in page 3, at the cost of maybe 15 instruction slots, which is sizeable..
This was one of the first experiments I tried, I noticed very early on that a lot of my instructions required a vPC fix up. I also noticed that now that I didn't have the same limitation as Marcel, (i.e. prioritising page3 ROM space), that I could effectively just calculate the correct vPC per instruction implementation. The big issues that I found were, backwards code compatibility and timing issues with EXIT and RESYNC.

1) Emulators, .gt1 files, vCPU code, GCL code all require and expect vPC to be pre-incremented by 2 by the firmware, so they all adjust vPC by -2 before using any vPC aware instructions; if we remove the pre-increment by 2 from dispatch and fetch the opcode and operands at the current vPC, how would we fix all the old software, apps and tools that would break?

2) EXIT and RESYNC seem to rely on a fixed cycle path that is matched with dispatch, (in terms of timing), if we reduce the cycle time of dispatch by 3, how do we guarantee that EXIT and RESYNC will still produce correct timing results?

What we basically have now is page3 as a jump table to *fast* vCPU/commonly used instructions and PREFIX as an extension to slower/less commonly used instructions, (but still potentially extremely useful in replacing old school vCPU streams that thrashed vAC). There is vast quantities of unused ROM space, so losing 15 instruction slots in page3, (which is 5 to 7 *fast* vCPU instructions), to save 3 words of ROM space per vCPU instruction doesn't seem worth it to me. I would rather just do the vCPU fixup per instruction implementation, (and waste a tiny percentage of ROM space), rather than lose those 5 to 7 *fast* vCPU instructions.

P.S. If you or anyone else has answers to these questions or can spot flaws in my reasoning, please don't hesitate to post.

Re: New vCPU instructions 2.0

Posted: 04 May 2021, 18:34
by at67
qwertyface wrote: 04 May 2021, 11:10 I see you've moved code some code to page 0. If I ever complete and merge my Forth, I need a short routine in page 0 (currently 5 instructions). Could you try to leave at least that much space for my purposes? The location isn't important.
I used a 3 slot launchpad to move SYS_Reset_88's implementation out of page0 before I moved ADDW into page0, so currently there is a contiguous segment of 31 nop's within page0 that are unused. I was planning on putting more *fast* SYS launch pads there, but we could just cordon them all off and leave them unused for FORTH and future expansion.
qwertyface wrote: 04 May 2021, 11:10 I see that you're making changes to vCpuSelect. In previous ROMs, this points to the page below the interpreter page to save space inside the interpreter, but the additional jump costs time on each entry. If you're moving to multiple interpreter pages, and have saved some space in page three, perhaps this could be changed to point at the page itself (with a defined layout for an interpreter entry page). As well as saving the jump on each entry, this could potentially allow sysFunctions to use4 this to jump to the right interpreter page, rather than always returning to vCPU - although I haven't thought carefully about the consequences of this.
ROMvX0 works in the exact same way as previous ROM's with respect to vCpuSelect, the only thing that has been added is a fast path for interrupt re-sync'ing when interrupting the PREFIX instruction. i.e. Interrupt entry, (vBlankFirst#82), saves vCpuSelect to 0x34 and interrupt exit, (vRTI#15), restores it and jumps to the correct page it references; this allows the interrupt routine to sync back up into vCPU land on the same time-slice instead of having to follow the RESYNC path and sync on the next available scan-line's time-slice.

You should be able to return to the correct interpreter page in SYS calls in exactly the same way, save vCpuSelect somewhere, do your stuff, restore it, increment it, then jump to it, (as seen in vRTI#15).

W.R.T. saving the branch at ENTER, I haven't thoroughly investigated this, but I would guess it would be quite difficult to remove it and not have to re-write a large portion of the vCPU instruction set and SYS calls, (which probably wouldn't be all that difficult, just time consuming), I'm also not sure what effect it would have on timing on the rest of the system, at first glance it would seem that adjusting some of the overhead defines for the runVcpu macro would be all that would be required.

Code: Select all

label('vBlankFirst#82')
st([0x30])                      #82 Save vPC
ld([vPC+1])                     #83
st([0x31])                      #84
ld([vAC])                       #85 Save vAC
st([0x32])                      #86
ld([vAC+1])                     #87
st([0x33])                      #88
ld([vCpuSelect])                #89 Save vCpuSelect for PREFIX
st([0x34])                      #90
ld([Y,vIRQ_v5])                 #91 Set vPC to vIRQ
suba(2)                         #92
st([vPC])                       #93
ld([Y,vIRQ_v5+1])               #94
st([vPC+1])                     #95
ld([vCpuSelect])                #96 Handler must save this if needed
st([vAC+1])                     #97
ld(0)                           #98
st([vAC])                       #99
ld(hi('ENTER'))                 #100 Set vCpuSelect to ENTER (=regular vCPU)
st([vCpuSelect])                #101
runVcpu(186-102-extra,          #102 Application cycles (scan line 0)
    '---D line 0 timeout with irq',
    returnTo='vBlankFirst#186')
    
    
# Interrupt handler:
#       STW  $xx        -> optionally store vCpuSelect
#       ... IRQ payload ...
# either:
#       LDWI $400
#       LUP  0          -> vRTI and don't switch interpreter (immediate resume)
# or:
#       LDWI $400
#       LUP  $xx        -> vRTI and switch interpreter type as stored in [$xx]
fillers(until=251-13)
label('vRTI#15')
ld([0x30])                      #15 Continue with vCPU in the same timeslice (faster)
st([vPC])                       #16
ld([0x31])                      #17
st([vPC+1])                     #18
ld([0x32])                      #19
st([vAC])                       #20
ld([0x33])                      #21
st([vAC+1])                     #22
ld([0x34])                      #23 Restore vCpuSelect for PREFIX
st([vCpuSelect])                #24
adda(1,Y)                       #25 Jump to correct PREFIX page
jmp(Y,'REENTER')                #26
ld(-30/2)                       #27
# vRTI entry point
assert(pc()&255 == 251)         # The landing offset 251 for LUP trampoline is fixed
beq('vRTI#15')                  #13 vRTI sequence
adda(1,X)                       #14
ld(hi('vRTI#18'),Y)             #15 Switch and wait for end of timeslice (slower)
jmp(Y,'vRTI#18')                #16
st([vTmp])                      #17

Re: New vCPU instructions 2.0

Posted: 06 May 2021, 11:23
by lb3361
I am convinced by your argument about the risk of causing creeping bugs in old software

Re: New vCPU instructions 2.0

Posted: 06 May 2021, 17:29
by at67
Here is what the implementation of PREFX3, it's dispatch page and some corresponding instructions look like now, you'll note the following:
  • Reset to the default vCPU page is now handled in dispatch.
  • vPC fix up is handled in PREFX3.
  • Parsing of the 2nd instruction operand, (PREFX3 instructions always end up having 2 operands), is handled in PREFX3.
  • The 2nd operand is stored in sysArgs+7, this is a valid storage location as the only code that can modify sysArgs+7 between PREFX3 and the PREFIX instruction is an interrupt and interrupts are required to save and restore sysFn and sysArgs+n registers anyway.
  • The programmer must be aware that PREFX3 modifies sysArgs+7, which could be an issue if relying on sysArgs+7 to hold state and mixing PREFIX instructions, (this is bad programming practice, but I already do it in the gtBASIC runtime for a couple of the lower sysArgs registers).
  • PREFX3 has increased in cycle time from 22 to 26 cycles.
  • The above changes have allowed PREFX3 instructions to be substantially more dense, thus allowing instruction functionality, (i.e. MOVW), that was not possible before.
P.S. MOVW is 4 bytes in length like LDW/STW and uses slightly more cycles than LDW/STW, so you may wonder what the point of it is; it doesn't destroy the contents of vAC which can be extremely handy in some situations.

Code: Select all

# pc = 0x03c7, Opcode = 0xc7
# Instruction PREFX3: switches instruction page to 0x2200
# Original idea by lb3361, see https://forum.gigatron.io/viewtopic.php?p=2099#p2099
label('PREFX3')
ld(hi('prefx3#13'),Y)           #10 #12
jmp(Y,'prefx3#13')              #11
ld(0x21)                        #12 ENTER is at $(n-1)ff, where n = instruction page
    .
    .
    .
# PREFX3 implementation
# Original idea by lb3361, see https://forum.gigatron.io/viewtopic.php?p=2099#p2099
label('prefx3#13')
st([vCpuSelect])                #13
ld([vPC])                       #14 Advance vPC
adda(2)                         #15
st([vPC],X)                     #16
adda(1,X)                       #17
ld([vPC+1],Y)                   #18
ld([Y,X])                       #19
st([sysArgs+7])                 #20 Second operand
ld([vCpuSelect])                #21
adda(1,Y)                       #22
jmp(Y,'NEXTY')                  #23
ld(-26/2)                       #24
    .
    .
    .
#-----------------------------------------------------------------------
#       PREFX3 instruction page
#-----------------------------------------------------------------------
# Original idea by lb3361, see https://forum.gigatron.io/viewtopic.php?p=2099#p2099
#
bra('.next2')                   #0 Enter at '.next2' (so no startup overhead)
# --- Page boundary ---
align(0x100,size=0x100)
ld([vPC+1],Y)                   #1

# Fetch next instruction and execute it, but only if there are sufficient
# ticks left for the slowest instruction.
adda([vTicks])                  #0 Track elapsed ticks (actually counting down: AC<0)
blt('EXIT')                     #1 Escape near time out
st([vTicks])                    #2
ld([vPC])                       #3 PREFX3 is 1 byte, vPC has been incremented by 2
suba(1,X)                       #4
st(vCpuSelect,[vCpuSelect])     #5 Reset to default vCPU page
ld([Y,X])                       #6 Fetch opcode (actually a branch target)
st([Y,Xpp])                     #7 Just X++
bra(AC)                         #8 Dispatch
ld([Y,X])                       #9 Prefetch operand

# Resync with video driver and transfer control
adda(maxTicks)                  #3
bgt(pc()&255)                   #4 Resync
suba(1)                         #5
ld(hi('vBlankStart'),Y)         #6
jmp(Y,[vReturn])                #7 To video driver
ld(0)                           #8 AC should be 0 already. Still..
assert vCPU_overhead ==          9

# pc = 0x2211, Opcode = 0x11
# Instruction ST2: Store vAC.lo into 16bit immediate address, (26 cycles)
# Original idea by lb3361, see https://forum.gigatron.io/viewtopic.php?p=2135#p2135
label('ST2')
ld(hi('st2#13'),Y)              #10
jmp(Y,'st2#13')                 #11
ld(AC,X)                        #12

# pc = 0x2214, Opcode = 0x14
# Instruction STW2: Store vAC into 16bit immediate address, (28 cycles)
# Original idea by lb3361, see https://forum.gigatron.io/viewtopic.php?p=2135#p2135
label('STW2')
ld(hi('stw2#13'),Y)             #10
jmp(Y,'stw2#13')                #11
ld(AC,X)                        #12

# pc = 0x2217, Opcode = 0x17
# Instruction XCHG: Swap two zero byte variables, (30 cycles)
label('XCHG')
ld(hi('xchg#13'),Y)             #10
jmp(Y,'xchg#13')                #11
# dummy                         #12
#
# pc = 0x2219, Opcode = 0x19
# Instruction MOVW: Move 16bits from src zero page var to dst zero page var, (30 cycles)
label('MOVW')
ld(hi('movw#13'),Y)             #10
jmp(Y,'movw#13')                #11
# dummy                         #12
#
    .
    .
    .
# ST2 implementation
# Original idea by lb3361, see https://forum.gigatron.io/viewtopic.php?p=2135#p2135
label('st2#13')
ld([sysArgs+7],Y)               #13 Second operand
ld([vAC])                       #14
st([Y,X])                       #15
ld(hi('NEXTY'),Y)               #16
jmp(Y,'NEXTY')                  #17
ld(-20/2)                       #18

# STW2 implementation
# Original idea by lb3361, see https://forum.gigatron.io/viewtopic.php?p=2135#p2135
label('stw2#13')
ld([sysArgs+7],Y)               #13 Second operand
ld([vAC])                       #14
st([Y,Xpp])                     #15
ld([vAC+1])                     #16
st([Y,X])                       #17
ld(hi('NEXTY'),Y)               #18
jmp(Y,'NEXTY')                  #19
ld(-22/2)                       #20

# XCHG implementation
label('xchg#13')
st([sysArgs+6])                 #13 1st var
ld([sysArgs+7],X)               #14 2nd var
ld([X])                         #15
st([vTmp])                      #16
ld([sysArgs+6],X)               #17
ld([X])                         #18
ld([sysArgs+7],X)               #19
st([X])                         #20
ld([sysArgs+6],X)               #21
ld([vTmp])                      #22
st([X])                         #23
ld(hi('NEXTY'),Y)               #24
jmp(Y,'NEXTY')                  #25
ld(-28/2)                       #26

# MOVW implementation
label('movw#13')
ld(AC,X)                        #13
adda(1)                         #14
st([vTmp])                      #15 address of src.hi
ld([X])                         #16 src.lo
ld([sysArgs+7],X)               #17 address of dst.lo
st([X])                         #18 dst.lo = src.lo
ld([vTmp],X)                    #19
ld([X])                         #20 src.hi
st([vTmp])                      #21
ld([sysArgs+7])                 #22
adda(1,X)                       #23 address of dst.hi
ld([vTmp])                      #24
st([X])                         #25 dst.hi = src.hi
ld(hi('NEXTY'),Y)               #26
jmp(Y,'NEXTY')                  #27
ld(-30/2)                       #28

Re: New vCPU instructions 2.0

Posted: 07 May 2021, 22:11
by lb3361
Very cool.

Re: New vCPU instructions 2.0

Posted: 10 May 2021, 11:24
by at67
Update:

I've added the following instructions to the PREFX3 page, the PREFX1 and PREFX2 pages are currently empty.
  • ADDWI <imm>, vAC += 16bit imm, 26+28 cycles.
  • SUBWI <imm>, vAC -= 16bit imm, 26+28 cycles.
  • ANDWI <imm>, vAC &= 16bit imm, 26+22 cycles.
  • XORWI <imm>, vAC ^= 16bit imm, 26+22 cycles.
  • ORWI <imm>, vAC |= 16bit imm, 26+22 cycles.
  • LDPX <xy var>, <colour var>, Load screen pixel using VTable indirection, 26+30 cycles.
  • STPX <xy var>, <colour var>, Store screen pixel using VTable indirection, 26+30 cycles.