I started experimenting with the idea of a 16bit vCPU stack pointer in the native code firmware about a week ago, the reason being that the BASIC compiler needs a proper stack for local variables, proc parameters and for recursion.
Currently there are 8 bytes of general purpose stack space, (for the BASIC compiler), used in the all important zero page ram which must be shared by nested proc's locals, params and recursion...as you can imagine, this is somewhat limiting and can make your code quite hard to write and debug when nesting procs with any of those features.
I found an unused location within zero page, (the reserved byte at 0x04), renamed it to vSPH and proceeded to add the following code to all stack aware instructions, PUSH, POP, LDLW, etc.
Code: Select all
ld(vSPH,Y)
ld([X]) --> ld([Y,X])
st([X]) --> st([Y,X])
Code: Select all
ld(0) #17
st([vSP]) #18 vSP
st([vSPH]) #19 vSPH <-- new instruction
This is the list of instructions I have created, tested and added to an experimental ROM as well as to the assembler and BASIC compiler:
Code: Select all
DEC <var> : (22 cycles), decrements a zero page variable's lower byte, borrow is ignored
DECW <var> : (28 cycles), decrements a zero page variable's 16bit value
INCW <var> : (26 cycles), increments a zero page variable's 16bit value
MOVQW <var>, imm : (30 cycles), loads a literal, (0..255), as a 16bit value into a zero page variable
MOVQ <var>, imm : (28 cycles), loads a literal, (0..255), as an 8bit value into a zero page variable
DBNE <var>, lab : (28 cycles), decrements and branches for !=0 on a zero page variable
DBGE <var>, lab : (30 cycles), decrements and branches for >=0 on a zero page variable
XCHG <var0>, <var1> : (30 cycles), exchanges bytes of any zero page variables
MOV <src>, <dst> : (28 cycles), copies a byte from src to dst, where src and dst are zero page variables
PEEKA <var> : (24 cycles), peek a byte from [vAC] to var, where var is a zero page variable
DEEKA <var> : (30 cycles), deek a word from [vAC] to var, where var is a zero page variable
POKEA <var> : (22 cycles), poke a byte from var to [vAC], where var is a zero page variable
DOKEA <var> : (30 cycles), doke a word from var to [vAC], where var is a zero page variable
NOTW <var> : (26 cycles), boolean inversion of any zero page variable
NEGW <var> : (28 cycles), arithmetic negate of any zero page variable
LSRB <var> : (28 cycles), logical shift right on any zero page byte
LSLV <var> : (26 cycles), logical shift left any zero page word variable
PEEKV <var> : (28 cycles), read byte from an address within any zero page variable
DEEKV <var> : (28 cycles), read word from an address within any zero page variable
ADDB <var>, imm : (28 cycles), adds a literal, (0..255), to a zero page byte variable
SUBB <var>, imm : (28 cycles), subtracts a literal, (0..255), from a zero page byte variable
PEEK+ <var> : (30 cycles), read byte from an address within any zero page variable and increment var.lo
POKE+ <var> : (28 cycles), write byte to an address within any zero page variable and increment var.lo
POKEI imm : (20 cycles), write an immediate byte, (0..255), to an address contained in [vAC]
DOKEI imm : (28 cycles), write an immediate word, (-32768..32767), to an address contained in [vAC]
TEQ <var> : (28 cycles), tests a zero page variable for EQ
TNE <var> : (28 cycles), tests a zero page variable for NE
TGE <var> : (26 cycles), tests a zero page variable for GE
TLT <var> : (26 cycles), tests a zero page variable for LT
TGT <var> : (28 cycles), tests a zero page variable for GT
TLE <var> : (28 cycles), tests a zero page variable for LE
ADDBI <var>, imm : (28 cycles), var.lo += imm
SUBBI <var>, imm : (28 cycles), var.lo -= imm
ANDBI <var>, imm : (28 cycles), var.lo &= imm
ORBI <var>, imm : (28 cycles), var.lo |= imm
XORBI <var>, imm : (28 cycles), var.lo ^= imm
ANDBA <var> : (24 cycles), vAC &= var.lo
ORBA <var> : (22 cycles), vAC |= var.lo
XORBA <var> : (22 cycles), vAC ^= var.lo
NOTB <var> : (22 cycles), var.lo = ~var.lo
DEEK+ <var> : (30 cycles), deek word at address contained in var into vAC, var.lo += 2
DOKE+ <var> : (30 cycles), doke word in vAC to address contained in var, var.lo += 2
LDNI imm : (22 cycles), vAC = -imm
COND imm1, imm0 : (30 cycles), vAC = one of two imm's based on vAC = 0 or vAC != 0
ANDBK <var>, imm : (30 cycles), vAC = var.lo & imm
ORBK <var>, imm : (30 cycles), vAC = var.lo | imm
XORBK <var>, imm : (30 cycles), vAC = var.lo ^ imm
CMPI <var>, imm : (30 cycles), vAC = var.lo CMP imm, unsigned 8bit compare for TCC/BCC/JCC
JEQ imm : (26 cycles), Jump to 16bit immediate address if vAC=0
JNE imm : (26 cycles), Jump to 16bit immediate address if vAC!=0
JLT imm : (24-26 cycles), Jump to 16bit immediate address if vAC<0
JGT imm : (24-28 cycles), Jump to 16bit immediate address if vAC>0
JLE imm : (24-28 cycles), Jump to 16bit immediate address if vAC<=0
JGE imm : (22-26 cycles), Jump to 16bit immediate address if vAC>=0
PEEKA+ <var> : (26 cycles), peek word at address contained in vAC into var, vAC.lo++
ST2 imm : (28 cycles), store vAC.lo into 16bit immediate address
STW2 imm : (30 cycles), store vAC into 16bit immediate address
Code: Select all
LDW ;(20 cycles) --> (24 cycles)
STW ;(20 cycles) --> (24 cycles)
ADDW ;(28 cycles) --> (32 cycles) --> (28 cycles)
CALL ;(26 cycles) --> (30 cycles)
POP ;(26 cycles) --> (30 cycles)
PUSH ;(26 cycles) --> (30 cycles)
LDWI ;(20 cycles) --> (24 cycles)
ST ;(16 cycles) --> (20 cycles)
LDI ;(16 cycles) --> (20 cycles)
CMPHU ;(28 cycles)
CMPHS ;(28 cycles)
ANDW ;(28 cycles) --> (26 cycles)
ORW ;(28 cycles) --> (26 cycles)
ADDI ;(28 cycles) --> (26 cycles)
SUBI ;(28 cycles) --> (26 cycles)
POKE ;(28 cycles) --> (26 cycles)
ANDI ;(22 cycles) --> (20 cycles)
INC ;(20 cycles) --> (16 cycles) --> (20 cycles)
LD ;(22 cycles) --> (18 cycles) --> (22 cycles) --> (18 cycles)
It was a lot of work unraveling, re-organising the vCPU interpreter and then optimising the old instructions, (the ones that I could), and creating the new instructions. Marcel had prioritised speed and ROM space when coding the original part of this firmware, I prioritised increased instruction slots over all else and it rather surprisingly allowed for some free optimisations in the old code as well. Currently there are between 6 and 9 instruction slots free, so if anyone has suggestions for new vCPU instructions, please feel free to add them to this thread.
P.S. You may note I have also modified the vCPU maxTicks count from 28 to 30, this allows a lot of instructions to be created that wouldn't have been possible at all otherwise. The execution effects that maxTicks=30 compared to maxTicks=28 has had on any code I have thrown at it, has been zero. Higher values start to have a more dominant effect, e.g. 32 reduces code speed by around 10%, 34 by about 15%.