New vCPU instructions 2.0

Post by **at67** » 10 Feb 2021, 22:09

TLDR: I added a bunch of new vCPU instructions and a 16bit stack pointer, (that is backwards compatible with all currently existing software), to an experimental ROM; there are now no instruction slots free, but if anyone has suggestions for new vCPU instructions or changes to the current ones, feel free to add them to this thread.

I started experimenting with the idea of a 16bit vCPU stack pointer in the native code firmware about a week ago, the reason being that the BASIC compiler needs a proper stack for local variables, proc parameters and for recursion.

Currently there are 8 bytes of general purpose stack space, (for the BASIC compiler), used in the all important zero page ram which must be shared by nested proc's locals, params and recursion...as you can imagine, this is somewhat limiting and can make your code quite hard to write and debug when nesting procs with any of those features.

I found an unused location within zero page, (the reserved byte at 0x04), renamed it to vSPH and proceeded to add the following code to all stack aware instructions, PUSH, POP, LDLW, etc.

Code: Select all

ld(vSPH,Y)
ld([X])	-->  ld([Y,X])
st([X])	-->  st([Y,X])

I also added the following to the reset routine, (SYS_Reset_88), for backwards compatibility, so that all current code would work with this change:

Code: Select all

ld(0)		#17
st([vSP])	#18 vSP
st([vSPH])	#19 vSPH	<--	new instruction

Then I wondered how many new instructions I could add to the vCPU interpreter while remaining backwards compatible with all current software and not affecting vCPU code execution time at all...

This is the list of instructions I have created, tested and added to an experimental ROM as well as to the assembler and BASIC compiler:

Code: Select all

DEC	<var> 		: (22 cycles), decrements a zero page variable's lower byte, borrow is ignored
DECW	<var> 		: (28 cycles), decrements a zero page variable's 16bit value
INCW	<var> 		: (26 cycles), increments a zero page variable's 16bit value
MOVQW   <var>, imm 	: (30 cycles), loads a literal, (0..255), as a 16bit value into a zero page variable
MOVQ    <var>, imm 	: (28 cycles), loads a literal, (0..255), as an 8bit value into a zero page variable
DBNE	<var>, lab 	: (28 cycles), decrements and branches for !=0 on a zero page variable
DBGE	<var>, lab 	: (30 cycles), decrements and branches for >=0 on a zero page variable
XCHG	<var0>, <var1>	: (30 cycles), exchanges bytes of any zero page variables
MOV 	<src>, <dst>	: (28 cycles), copies a byte from src to dst, where src and dst are zero page variables
PEEKA	<var>		: (24 cycles), peek a byte from [vAC] to var, where var is a zero page variable 
DEEKA	<var>		: (30 cycles), deek a word from [vAC] to var, where var is a zero page variable 
POKEA	<var>		: (22 cycles), poke a byte from var to [vAC], where var is a zero page variable 
DOKEA	<var>		: (30 cycles), doke a word from var to [vAC], where var is a zero page variable 
NOTW	<var>		: (26 cycles), boolean inversion of any zero page variable
NEGW 	<var>		: (28 cycles), arithmetic negate of any zero page variable
LSRB	<var>		: (28 cycles), logical shift right on any zero page byte
LSLV	<var>		: (26 cycles), logical shift left any zero page word variable
PEEKV	<var>		: (28 cycles), read byte from an address within any zero page variable
DEEKV	<var>		: (28 cycles), read word from an address within any zero page variable
ADDB	<var>, imm	: (28 cycles), adds a literal, (0..255), to a zero page byte variable
SUBB	<var>, imm	: (28 cycles), subtracts a literal, (0..255), from a zero page byte variable
PEEK+   <var>		: (30 cycles), read byte from an address within any zero page variable and increment var.lo
POKE+   <var>		: (28 cycles), write byte to an address within any zero page variable and increment var.lo
POKEI   imm		: (20 cycles), write an immediate byte, (0..255), to an address contained in [vAC]
DOKEI   imm		: (28 cycles), write an immediate word, (-32768..32767), to an address contained in [vAC]
TEQ	<var>		: (28 cycles), tests a zero page variable for EQ
TNE	<var>		: (28 cycles), tests a zero page variable for NE
TGE	<var>		: (26 cycles), tests a zero page variable for GE
TLT	<var>		: (26 cycles), tests a zero page variable for LT
TGT	<var>		: (28 cycles), tests a zero page variable for GT
TLE	<var>		: (28 cycles), tests a zero page variable for LE
ADDBI   <var>, imm	: (28 cycles), var.lo += imm
SUBBI   <var>, imm	: (28 cycles), var.lo -= imm
ANDBI   <var>, imm	: (28 cycles), var.lo &= imm
ORBI    <var>, imm	: (28 cycles), var.lo |= imm
XORBI   <var>, imm	: (28 cycles), var.lo ^= imm
ANDBA   <var>		: (24 cycles), vAC &= var.lo
ORBA    <var>		: (22 cycles), vAC |= var.lo
XORBA   <var>		: (22 cycles), vAC ^= var.lo
NOTB    <var>		: (22 cycles), var.lo = ~var.lo
DEEK+   <var>		: (30 cycles), deek word at address contained in var into vAC, var.lo += 2
DOKE+   <var>		: (30 cycles), doke word in vAC to address contained in var, var.lo += 2
LDNI	imm		: (22 cycles), vAC = -imm
COND    imm1, imm0	: (30 cycles), vAC = one of two imm's based on vAC = 0 or vAC != 0
ANDBK   <var>, imm	: (30 cycles), vAC = var.lo & imm
ORBK    <var>, imm	: (30 cycles), vAC = var.lo | imm
XORBK   <var>, imm	: (30 cycles), vAC = var.lo ^ imm
CMPI    <var>, imm	: (30 cycles), vAC = var.lo CMP imm, unsigned 8bit compare for TCC/BCC/JCC
JEQ     imm		: (26 cycles), Jump to 16bit immediate address if vAC=0
JNE     imm		: (26 cycles), Jump to 16bit immediate address if vAC!=0
JLT     imm		: (24-26 cycles), Jump to 16bit immediate address if vAC<0
JGT     imm		: (24-28 cycles), Jump to 16bit immediate address if vAC>0
JLE     imm		: (24-28 cycles), Jump to 16bit immediate address if vAC<=0
JGE     imm		: (22-26 cycles), Jump to 16bit immediate address if vAC>=0
PEEKA+	<var>		: (26 cycles), peek word at address contained in vAC into var, vAC.lo++
ST2	imm		: (28 cycles), store vAC.lo into 16bit immediate address
STW2	imm		: (30 cycles), store vAC into 16bit immediate address

This is the list of instruction timings I have modified for old instructions, (you'll notice some have gone up and some have gone down, hence the negligible overall instruction execution difference), in my testing all my applications ran at the exact same speed, but I would expect some potential small variance, either up or down depending on the workload, (any instructions not listed have not been modified compared to ROMv5a).

Code: Select all

LDW     ;(20 cycles)  -->  (24 cycles)
STW     ;(20 cycles)  -->  (24 cycles)
ADDW    ;(28 cycles)  -->  (32 cycles)  -->  (28 cycles)
CALL	;(26 cycles)  -->  (30 cycles)
POP	;(26 cycles)  -->  (30 cycles)
PUSH	;(26 cycles)  -->  (30 cycles)
LDWI	;(20 cycles)  -->  (24 cycles)
ST	;(16 cycles)  -->  (20 cycles)
LDI	;(16 cycles)  -->  (20 cycles)
CMPHU   ;(28 cycles)
CMPHS   ;(28 cycles)

ANDW	;(28 cycles)  -->  (26 cycles)
ORW	;(28 cycles)  -->  (26 cycles)
ADDI    ;(28 cycles)  -->  (26 cycles)
SUBI    ;(28 cycles)  -->  (26 cycles)
POKE	;(28 cycles)  -->  (26 cycles)
ANDI	;(22 cycles)  -->  (20 cycles)
INC	;(20 cycles)  -->  (16 cycles)  -->  (20 cycles)
LD	;(22 cycles)  -->  (18 cycles)  -->  (22 cycles)  -->  (18 cycles)

I also removed these two instructions, CMPHS and CMPHU, not only are they not used by any current code, they haven't been fully tested and verified for functionality. IMHO these two instructions were always a waste of valuable instruction slots as they were trying to solve a corner case issue that is already solvable using standard vCPU code, (see Marcel's handling of this issue in tinyBASIC.gcl). The programmer already knows he has to and how to deal with overflow/underflows using the native signed 16 bit arithmetic format of vCPU.

It was a lot of work unraveling, re-organising the vCPU interpreter and then optimising the old instructions, (the ones that I could), and creating the new instructions. Marcel had prioritised speed and ROM space when coding the original part of this firmware, I prioritised increased instruction slots over all else and it rather surprisingly allowed for some free optimisations in the old code as well. Currently there are between 6 and 9 instruction slots free, so if anyone has suggestions for new vCPU instructions, please feel free to add them to this thread.

P.S. You may note I have also modified the vCPU maxTicks count from 28 to 30, this allows a lot of instructions to be created that wouldn't have been possible at all otherwise. The execution effects that maxTicks=30 compared to maxTicks=28 has had on any code I have thrown at it, has been zero. Higher values start to have a more dominant effect, e.g. 32 reduces code speed by around 10%, 34 by about 15%.

cde · Post by **cde** » 11 Feb 2021, 10:38

Impressive work! Congratulations @at67

Post by **at67** » 12 Feb 2021, 00:36

@cde Cheers!

Update:
I've removed LARR and SWAP, LARR didn't offer enough of a benefit in real code to warrant it's existence and SWAP was a specific case of XCHG that also couldn't justify it's own existence.

The three new instructions that replace them are:

LDWQ

Code: Select all

; loads a literal 0..255 into a zero page word variable
LDWQ   var, 55

; this replaces the very common sequence of
LDI    55
STW    var

MOVB

Code: Select all

; move/copy a byte from src to dst, this lets you peek and poke a byte anywhere in zero page
MOVB   src, dst

; replaces this common sequence and doesn't destroy the contents of vAC
LD     src
ST     dst

MOVBA

Code: Select all

; move/copy a byte from [vAC] to dst, this lets you peek a byte anywhere in 16bit address space and save it into zero page
MOVBA  dst

; replaces this common sequence
PEEK
ST     dst

TEQ TNE TLT TGT TLE TGE
I've also added 6 new condition code testing instructions that allow boolean conditions to be much more efficient, e.g.
Old assembly for e = a < b

Code: Select all

LDW    _a
SUBW   _b
CALL   convertLtOpAddr
STW    _e
HALT

convertLtOp   BLT   convertLt_1
              LDI   0
              RET
convertLt_1   LDI   1
              RET

New assembly for e = a < b

Code: Select all

LDW    _a
SUBW   _b
TLT    _vAC
STW    _e
HALT

Post by **at67** » 13 Feb 2021, 06:00

Update:

maxTicks was increased from 30 to 32, the decreases in instruction execution cycles I have achieved mostly offsets the
increase in overall execution time from increasing maxTicks above 30.
ADDW was moved from page0 to page1, this increased it's execution cycles from 28 to 32
XCHG was changed from using [vAC] and [operand0] to [operand0] and [operand1], this allows byte swaps in any zero
page variables with 1 instruction and word swaps in any zero page variables with 2 instructions, (normally 6 instructions
for both byte and word swaps without XCHG and 2 and 4 with the old XCHG).
ADDI execution cycles were decreased from 28 to 26
SUBI execution cycles were decreased from 28 to 26
Added the COND instruction, 32 cycles

Code: Select all

; COND is a ternary operator that chooses one of two 0..255 constant operands based on [vAC] <> 0
LDI    0
COND   0x55, 0xAA    ; [vAC] = 0x00AA

LDI    1
COND   0x55, 0xAA    ; [vAC] = 0x0055

Post by **at67** » 14 Feb 2021, 03:01

Update:

LD was moved from page0 to page1, this increased it's execution cycles from 18 to 22
Added the CMPI instruction, 32 cycles

Code: Select all

; CMPI subtracts 0..255 from a zero page variable, storing the sign and result in [vAC]
; the result is not numerically correct, but is valid for branching and testing with the BCC and TCC instructions

; CMPI replaces this
LDW    var
SUBI   10

; with this
CMPI   var, 10

Post by **at67** » 14 Feb 2021, 04:52

Update:

DEC was moved from page0 to page1, this increased it's execution cycles from 16 to 20
Added the LDQ instruction, 26 cycles

Code: Select all

; LDQ loads 0..255 into a zero page variable byte

; LDQ replaces this
LDI    10
ST     var + 1

; with this
LDQ    var + 1, 10

Post by **at67** » 14 Feb 2021, 20:39

Update:

ST was moved from page0 to page1, this increased it's execution cycles from 16 to 20
Added the ADDB instruction, 28 cycles
Added the SUBB instruction, 28 cycles

Code: Select all

; ADDB/SUBB replace this
LD     		var
ADDI/SUBI	10
ST     		var

; with this
ADDB/SUBB	var, 10

Post by **at67** » 18 Feb 2021, 07:12

Update:

maxTicks was reduced from 32 cycles back to 30, the increase in average app execution time was not worth the extra functionality provided by the three 32 cycle instructions.
Removed COND
Removed CMPI
Reverted XCHG from XCHG [var0] [var1] to XCHG [var1] where [vAC] = var0
Reverted ADDW cycles from 32 back to 28
Added LSLV [var] left shift 16bits of any zero page var
Added PEEKV [var] peek's a byte from an address stored in any zero page var
Added LSRB [var] right shifts 8bits of any zero page var
Increased ALLOC's cycle count from 14 to 20
Increased LDI's cycle count from 16 to 20
Increased DEC's cycle count from 16 to 22
Increased INC's cycle count from 16 to 20

lb3361 · Post by **lb3361** » 21 Feb 2021, 22:25

The 16 bits stack is a biggie. Will you make it available in the dev rom soon?

Post by **at67** » 21 Feb 2021, 23:18

lb3361 wrote: ↑21 Feb 2021, 22:25 The 16 bits stack is a biggie. Will you make it available in the dev rom soon?

I am currently testing an experimental ROM with the new instructions and 16bit stack, there is only one problem left and that is the VBlank cursor flash for the Apple-1 emulator causes a timing glitch and hence a loss of monitor sync.

The 6502 emulation itself works harmoniously and correctly with the 16bit vSP, maxTicks=30 and the new instructions, it's just the vCPU, ("Apple-1_v2.gcl : line 910"), cursor flash VBlank handler that is a current problem.

P.S. I'm finishing some documentation, so it will take a couple of days to get back to the experimental ROM.

Gigatron Hackers

New vCPU instructions 2.0

New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0