New vCPU instructions 2.0

Using, learning, programming and modding the Gigatron and anything related.
Forum rules
Be nice. No drama.
lb3361
Posts: 360
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

Another question: Did you consider opcodes ST2 or STW2 that are the same as ST or STW but with two bytes of address?
Note that is little need for the counterparts LD2/LDW2 because the combination LDWI+PEEK/DEEK is only four bytes long and exactly equivalent.

Context: I was trying to write python code to implement a pseudo-instruction _MOV(s,d) where s and d can be registers, addresses, or [AC].
Here is the best I could do. In this code, _LDI is either LDI(d) or LDWI(dd), _LDW is either LDW(d) or LDWI(dd);DEEK(), and T2/T3 are temporaries
You'll note how DOKEA/DEEKA simplifies some paths. See how MOV(anything,[AC]) or MOV(anything,zp) are compact but MOV([AC],dddd) require swapping AC and two scratch registers...

Code: Select all

@vasm
def _MOV(s,d):
    '''Move word from reg/addr s to d.
       Also accepts [AC] for s or d.'''
    s = v(s)
    d = v(d)
    if s != d:
        if args.cpu > 5 and s == [AC] and is_zeropage(d):
            DEEKA(d)
        elif args.cpu > 5 and is_zeropage(s) and d == [AC]:
            DOKEA(s)
        elif d == [AC]:
            STW(T3)
            if s != AC:
                _LDW(s)
            DOKE(T3)
        elif is_zeropage(d):
            if s == [AC]:
                DEEK()
            elif s != AC:
                _LDW(s)
            if d != AC:
                STW(d)
        elif s == AC or s == [AC]:
            if s == [AC]:
                DEEK()
            STW(T3); _LDI(d)
            if args.cpu > 5:
                DOKEA(T3)
            else:
                STW(T2); LDW(T3); DOKE(T2)
        else:
            _LDI(d); STW(T2); _LDW(s); DOKE(T2)
For the long version, some combinations were so long that I prefer to call a generic routine...

Code: Select all

def _LMOV(s,d):
    '''Move long from reg/addr s to d.
       Also accepts [AC] as s, and [AC] or [T2] as d.'''
    s = v(s)
    d = v(d)
    if s != d:
        if is_zeropage(d, 3):
            if is_zeropage(s, 3):
                _LDW(s); STW(d); _LDW(s+2); STW(d+2)      # 8 bytes
            elif args.cpu > 5:
                if s != [AC]:
                    _LDI(s)
                DEEKA(d); ADDI(2); DEEKA(d+2)             # 6-9 bytes
            elif s != [AC]:
                _LDW(s); STW(d); _LDW(s+2); STW(d+2)      # 12 bytes
            else:
                STW(T3); DEEK(); STW(d)
                _LDW(T3); ADDI(2); DEEK(); STW(d+2);      # 12 bytes
        elif is_zeropage(s, 3) and args.cpu > 5:
            if d == [T2]:
                _LDW(T2)
            elif s != [AC]:
                _LDI(s)
            DOKEA(s); ADDI(2); DOKEA(s+2)                 # 6-9 bytes
        else:
            if d == [AC]:
                STW(T2)
            if s == [AC]:
                STW(T3)
            if d != [AC] and d != [T2]:
                _LDI(d); STW(T2)
            if s != [AC] and s != [T3]:               # call sequence
                _LDI(s); STW(T3)                      # 5-13 bytes
            extern('_@_lcopy')
            _CALLI('_@_lcopy')  #   [T3..T3+4) --> [T2..T2+4)
            
at67
Site Admin
Posts: 647
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

Update:

I've managed to find another 14 instruction slots by re-writing ADDW, SUBW, XORI and BRA, the trick to getting ADDW to work was in moving it to page0 and thus not needing a ld([vPC+1],Y) instruction.

I also used a launch pad to move the SYS_Exec_88 implementation out of page0, (to make room for the ADDW implementation).

There are two 3 slot launch pads available for two new instructions that I have left as spares for future additions, the only instruction left that can be moved out of page3 is SYS, but that's probably not possible due to timing limitations already built into current SYS calls.

12 New instructions, a few of these are old ones that have been re-instated, i.e. LDNI, CMPI, COND; the JCC, jump instructions are particularly handy, as they make the gigatron's segmented memory map much simpler to deal with in ROMvX0.

* vAC = 16bits, .lo = 8bits, imm = 8bits
  • LDNI, vAC = -imm, 22 cycles.
  • COND, vAC = one of two imm's based on vAC = 0 or 1, 30 cycles.
  • ANDBK, vAC = var.lo & imm, 30 cycles.
  • ORBK, vAC = var.lo | imm, 30 cycles.
  • XORBK, vAC = var.lo ^ imm, 30 cycles.
  • CMPI, vAC = var.lo CMP imm, not a numerically correct 8bit unsigned subtraction, but good enough for test/branch/jump, 30 cycles.
  • JEQ, Jump to 16bit immediate address if vAC=0, 26 cycles.
  • JNE, Jump to 16bit immediate address if vAC!=0, 26 cycles.
  • JLT, Jump to 16bit immediate address if vAC<0, 24-26 cycles.
  • JGT, Jump to 16bit immediate address if vAC>0, 24-28 cycles.
  • JLE, Jump to 16bit immediate address if vAC<=0, 24-28 cycles.
  • JGE, Jump to 16bit immediate address if vAC>=0, 22-26 cycles.
P.S. I experimented with the BCC instructions to see if I could use a flag and 1 slot early execution to enable 3 byte and 2 byte versions of BCC, (3 byte for backwards compatibility and 2 byte for ROMvX0, thus saving a byte per BCC instruction in ROMvX0), and it is theoretically possible, but not with the current layout of instructions. The new JCC instructions have put this on the back burner for now.
at67
Site Admin
Posts: 647
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

lb3361 wrote: 22 Apr 2021, 14:04 Another question: Did you consider opcodes ST2 or STW2 that are the same as ST or STW but with two bytes of address?
Note that is little need for the counterparts LD2/LDW2 because the combination LDWI+PEEK/DEEK is only four bytes long and exactly equivalent.
There are two 3 slot launchpads available for 2 new instructions, we could fill them with these 2 candidates or use them to implement your earlier PREFIX instruction idea.

P.S. I've updated the first post in this thread with the current complete list of instructions and changed cycle times/names.
lb3361
Posts: 360
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

You're the best person to make that call.
lb3361
Posts: 360
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

About CMPHI/CMPHS not being much tested. It turns out that the complicated code to do this without cmphi/cmphs is even less reliable. See https://github.com/kervinck/gigatron-rom/issues/192 ? This deserves fixing...

I was working on the c runtime and I wanted to see how this was done in tiny basic. After finding this bug, I checked that both the fix and the cmphs/cmphi method are working by trying all 4 billion combinations of two 16 numbers. Good news : they work.
at67
Site Admin
Posts: 647
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

lb3361 wrote: 24 Apr 2021, 19:36 About CMPHI/CMPHS not being much tested. It turns out that the complicated code to do this without cmphi/cmphs is even less reliable. See https://github.com/kervinck/gigatron-rom/issues/192 ? This deserves fixing...
I saw that but I just don't have time to explore it in any depth, if you have time to fix it and would like to do a pull request, please do if/whenever you get a chance.
lb3361 wrote: 24 Apr 2021, 19:36 I was working on the c runtime and I wanted to see how this was done in tiny basic. After finding this bug, I checked that both the fix and the cmphs/cmphi method are working by trying all 4 billion combinations of two 16 numbers. Good news : they work.
Very cool, thanks for testing them.
lb3361
Posts: 360
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

Also I just realized that the vSPH change does not only changes PUSH and POP but also sys_Exec! There may be other places like that.

I just added support in my lcc for some of your instructions. Some of them are *very* useful. POKEI/DOKEI/POKEA/DOKEA save a lot of bytes.

I wasn't able to implement many of them because I do not know the order of the two operands, e.g. MOVB, MOVQ.

I still believe STW2 ST2 (STW and ST with 2 byte addresses) would be a win.

Thanks.


( I have a few busy weeks ahead of me. I may or may not have a showable compiler before that...)
at67
Site Admin
Posts: 647
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

lb3361 wrote: 30 Apr 2021, 11:09 Also I just realized that the vSPH change does not only changes PUSH and POP but also sys_Exec! There may be other places like that.
sys_Exec is the only other area of code affected.
lb3361 wrote: 30 Apr 2021, 11:09 I just added support in my lcc for some of your instructions. Some of them are *very* useful. POKEI/DOKEI/POKEA/DOKEA save a lot of bytes.

I wasn't able to implement many of them because I do not know the order of the two operands, e.g. MOVB, MOVQ.
I'll update the original post's list of new instructions with a more complete description of parameters.
lb3361 wrote: 30 Apr 2021, 11:09 I still believe STW2 ST2 (STW and ST with 2 byte addresses) would be a win.
I've added the PREFIX instruction and the first two instructions to make use of it, are: ST2 and STW2.

It all works very well, even with interrupts, you can seamlessly use the new PREFIX instructions within your main loop and within interrupts and there are no race conditions or glitches.

It was surprisingly less drama filled than I anticipated, there are two ways of returning from an interrupt, you can use the slow RESYNC way, (vRTI#18), which waits for the next available scanline's time slice, or the much faster modified vRTI#15 way. I modified vRTI#15 to read a hard coded zero page variable that contains a backup of vCpuSelect and use that to jump back to the correct page as well as restore vCpuSelect, e.g:

Code: Select all

# Interrupt handler:
#       STW  $xx        -> optionally store vCpuSelect
#       ... IRQ payload ...
# either:
#       LDWI $400
#       LUP  0          -> vRTI and don't switch interpreter (immediate resume)
# or:
#       LDWI $400
#       LUP  $xx        -> vRTI and switch interpreter type as stored in [$xx]
fillers(until=251-13)
label('vRTI#15')
ld([0x30])                      #15 Continue with vCPU in the same timeslice (faster)
st([vPC])                       #16
ld([0x31])                      #17
st([vPC+1])                     #18
ld([0x32])                      #19
st([vAC])                       #20
ld([0x33])                      #21
st([vAC+1])                     #22
#ld(hi('REENTER'),Y)             #23
ld([0xD5])                      #23
st([vCpuSelect])                #24
adda(1,Y)                       #25
jmp(Y,'REENTER')                #26
ld(-30/2)                       #27
# vRTI entry point
assert(pc()&255 == 251)         # The landing offset 251 for LUP trampoline is fixed
beq('vRTI#15')                  #13 vRTI sequence
adda(1,X)                       #14
ld(hi('vRTI#18'),Y)             #15 Switch and wait for end of timeslice (slower)
jmp(Y,'vRTI#18')                #16
st([vTmp])                      #17
I also swapped the two zero page variables, (channel), and, (vCpuSelect), in the zero page so as to facilitate fast reset of vCpuSelect, e.g:

Code: Select all

# old
0000      zeroConst     Constant value 0 (for arithmetic carry)
0001      memSize       Number of RAM pages detected at hard reset (64kB=0)
0002      (channel)     Sound channel update on current scanline
0003      (sample)      Accumulator for synthesizing next sound sample
0004      (reserved)    Reserved (Video extensions? MMU? v8808? ...?)
0005      (vCpuSelect)  Entry page of active interpreter (offset fixed to 255)

# new
0000      zeroConst     Constant value 0 (for arithmetic carry)
0001      memSize       Number of RAM pages detected at hard reset (64kB=0)
0002      (vCpuSelect)  Entry page of active interpreter (offset fixed to 255)
0003      (sample)      Accumulator for synthesizing next sound sample
0004      (reserved)    Reserved (Video extensions? MMU? v8808? ...?)
0005      (channel)     Sound channel update on current scanline
I first had to make sure that, (channel), wasn't being restored/set/reset in the same way as, (sample), otherwise it would be hard-coded to it's old address like, (sample), is.

Swapping them now allows PREFIX instructions to reset vCpuSelect with one instruction instead of two, e.g. st(vCpuSelect,[vCpuSelect])

Code: Select all

# pc = 0x03c7, Opcode = 0xc7
# Instruction PREFIX: switches instruction page to 0x2200
label('PREFIX')
ld(hi('prefix#13'),Y)           #10 #12
jmp(Y,'prefix#13')              #11
ld(0x21)                        #12 ENTER is at $(n-1)ff, where n = instruction page
	.
	.
	.
# PREFIX implementation
label('prefix#13')
st([vCpuSelect])                #13
adda(1,Y)                       #14
ld([vPC])                       #15
suba(1)                         #16
st([vPC])                       #17
jmp(Y,'REENTER')                #18
ld(-22/2)                       #19
	.
	.
	.
# pc = 0x2214, Opcode = 0x14
# Instruction STW2: Store vAC into 16bit immediate address, (30 cycles)
label('STW2')
ld(hi('stw2#13'),Y)             #10
jmp(Y,'stw2#13')                #11
st(vCpuSelect,[vCpuSelect])     #12 reset to default vCPU page
	.
	.
	.
# STW2 implementation
label('stw2#13')
ld([vPC+1],Y)                   #13
st([vTmp])                      #14
st([Y,Xpp])                     #15 X++
ld([Y,X])                       #16
ld(AC,Y)                        #17
ld([vTmp],X)                    #18
ld([vAC])                       #19
st([Y,Xpp])                     #20
ld([vAC+1])                     #21
st([Y,X])                       #22
ld([vPC])                       #23
adda(1)                         #24
st([vPC])                       #25
ld(hi('NEXTY'),Y)               #26
jmp(Y,'NEXTY')                  #27
ld(-30/2)                       #28
PREFIX turned out to be an excellent addition and we now have roughly 80-100 new instructions available, with just one extra page as a jump table to implementations.

The other spare instruction slot is currently filled with a PEEKA+ instruction, but it can be moved to the PREFIX bank and we could create another PREFIX bank for even more instructions if needed.
lb3361
Posts: 360
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

This is very cool.

st(vCpuSelect,[vCpuSelect]) :-O.
I suppose this one is worthy of Marcel.

In fact your Jcc instructions are no longer than the "short" Bcc ones, and they might even run as fast. This leaves very few reasons to use Bcc anymore.

Did you have a chance using a different maxtick for the new page?

Why [$d5] and not [$34], for instance...
Last edited by lb3361 on 01 May 2021, 13:30, edited 1 time in total.
at67
Site Admin
Posts: 647
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

lb3361 wrote: 01 May 2021, 13:18 In fact your Jcc instructions are no longer than the "short" Bcc ones, and they might even run as fast. This leaves very few reasons to use Bcc anymore.
I'm still going to try and shorten the BCC instructions to 2 bytes, (whilst remaining backwards compatible), I did do a test using 0x80/0x00/bmi flag tests and it worked fine, but the current layout of the instructions mean that there were unwanted pipeline overlaps causing the flags to be overwritten for some of the instructions.

This will require a major re-organisation of page 3 and I am not sure it will %100 fully solve the issue, so I am going to leave BCC optimisation for the next ROM version.
lb3361 wrote: 01 May 2021, 13:18 Did you have a chance using a different maxtick for the new page?
Not yet, I'll have a play with that tomorrow, I really wanted to release ROMvX0 this weekend and I may still do that, just with very little documentation, (apart from source comments).

P.S. The original message contains updated parameter lists, name changes and the additional new instructions.
Post Reply