New vCPU instructions 2.0

Using, learning, programming and modding the Gigatron and anything related.
Forum rules
Be nice. No drama.
lb3361
Posts: 360
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

... (continued) ...

Of course one can have multiple pages of additional instructions with different prefixes. I mention it because I just realized that one also make pages with special dispatchers, for instance, a page with a jump table instead of bra(['AC']), or a page specializing in instructions that can use more than the maximum 28 or 30 cycles of the page 3 instructions.
at67
Site Admin
Posts: 647
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

lb3361 wrote: 06 Apr 2021, 09:22 Here is an idea that you might like. I think this one is useful:
This also very cool, and if I understand it correctly it will produce instruction streams similar to this:

Code: Select all

;page 3 instructions
p3:i(0)
p3:i(1)
p3:i(2)

;page 4 instructions
PREFIX	0x04
p4:i(0)
PREFIX	0x04
p4:i(1)

;page 5 instructions
PREFIX	0x05
p5:i(0)
PREFIX	0x05
p5:i(1)

;back to page 3 instructions
p3:i(3)
p3:i(4)
p3:i(5)
Why not change it do be more like this, (rename it to SET Instruction Page, SETIP):

Code: Select all

;page 3 instructions
p3:i(0)
p3:i(1)
p3:i(2)

;page 4 instructions
SETIP	0x04
p4:i(0)
p4:i(1)

;page 5 instructions
SETIP	0x05
p5:i(0)
p5:i(1)

;back to page 3 instructions
SETIP	0x03
p3:i(3)
p3:i(4)
p3:i(5)
Each instruction would return back to it's own page dispatch rather than the page3 dispatch as in your original idea and the only way to move between pages or to get back to page3 would be with the SETIP instruction.

Worst case scenario SETIP would function exactly like PREFIX, (except that you would need an extra SETIP to get back to page 3), best case scenario, (for long runs of same page instructions, you end up saving a substantial amount of code space and code cycles and a smart compiler could try and batch same page instructions as much as possible.

This would probably require a statistical analysis of compiler generated code and using the output of that analysis to allocate the instructions to pages to produce the longest runs of same page instructions as possible.

*Edit* Thinking about this some more, my change to your idea would probably be substantially worse in practice, as code would likely be tightly interleaved between page 3 and the other pages, thus not allowing for any appreciable runs of same page instructions. Even more importantly, my original conjecture that worst case SETIP would be the same as PREFIX is wrong, it could be much worse, if for example you had an instruction stream that oscillated between page 3 and another page.
lb3361
Posts: 360
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

I am afraid that SETIP creates additional state. Whenever you jump to a label, you need to make sure you have the right page selected. In practice this means that every jump must be accompanied by a SETIP (or every label must be followed by a SETIP). The 65816 had states that affected whether instructions were operating on 8 or 16 bits (and that was a mess). This is why I would revert to page 3 after each instruction. Basically this introduces one byte prefixes to form two byte opcodes (one for the prefix, one for the opcode in the page announced by the prefix.)

But I really like the idea that certain prefixes change the maximum instruction time. That solves both the problem of limited instruction slots and the problem of the maximum instruction execution time. The main drawback is that we lose ~12 cycles to process the prefix and switch page. Compare this to the ~6 cycles one needs to jump to code located in a different page and back. On the other hand, this could be less than what we lose by increasing the execution time of frequent instruction in order to open new instruction slots in page 3.

In fact I would even argue to implement the page X dispactcher with jump tables, even if this costs additional cycles. My main concern here is to give more implementation flexibility while keeping the same opcodes. The kind of gymnastics you had to do in page 3 to change the code while keeping the same opcodes seems unwise. Even fixing a minor bug could be a major problem.
at67
Site Admin
Posts: 647
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

lb3361 wrote: 06 Apr 2021, 22:29 I am afraid that SETIP creates additional state. Whenever you jump to a label, you need to make sure you have the right page selected. In practice this means that every jump must be accompanied by a SETIP (or every label must be followed by a SETIP). The 65816 had states that affected whether instructions were operating on 8 or 16 bits (and that was a mess). This is why I would revert to page 3 after each instruction. Basically this introduces one byte prefixes to form two byte opcodes (one for the prefix, one for the opcode in the page announced by the prefix.)
I understand the part about the prefix opcode, what I don't understand is why you can't do this:

vCPU code

Code: Select all

LDW     0x30        ; page 3 
ADDI    1           ; page 3 
STW     0x30        ; page 3 
SETIP   0x15        ; page 3 
LSLB    0x30        ; page 15
MULU    0x30, 0x31  ; page 15, (imaginary instruction)
SETIP   0x03        ; page 3, now back to normal execution
SETIP native code

Code: Select all

# pc = 0x031c, Opcode = 0x1c, (example address/opcode)
# Instruction SETIP <page>: Set instruction page to <0-255>, 16 cycles
label('SETIP')
ld(AC,Y)	    #10
st([vCpuSelect])    #11 needed for interrupts
jmp(Y,'REENTER')    #12 the REENTER/NEXT/NEXTY labels only exist in page3, we extract the low byte of REENTER and combine it with <page>
ld(-16/2)           #13
Now that we are in a new page, (after running PREFIX/SETIP), we have a duplicate of the dispatch code, so 'NEXTY", 'NEXT', and 'EXIT', (without redefining the labels, as they are already defined in page3 and we only access their low bytes so that we can correctly branch within any page):

Code: Select all

bra('.next2')                   #0 Enter at '.next2' (so no startup overhead)
# --- Page boundary ---
align(0x100,size=0x100)
ld([vPC+1],Y)                   #1

# Fetch next instruction and execute it, but only if there are sufficient
# ticks left for the slowest instruction.
adda([vTicks])                  #0 Track elapsed ticks (actually counting down: AC<0)
blt('EXIT')                     #1 Escape near time out
st([vTicks])                    #2
ld([vPC])                       #3 Advance vPC
adda(2)                         #4
st([vPC],X)                     #5
ld([Y,X])                       #6 Fetch opcode (actually a branch target)
st([Y,Xpp])                     #7 Just X++
bra(AC)                         #8 Dispatch
ld([Y,X])                       #9 Prefetch operand

# Resync with video driver and transfer control
adda(maxTicks)                  #3
bgt(pc()&255)                   #4 Resync
suba(1)                         #5
ld(hi('vBlankStart'),Y)         #6
jmp(Y,[vReturn])                #7 To video driver
ld(0)                           #8 AC should be 0 already. Still..
assert vCPU_overhead ==          9
And we also have a duplicate of the 'REENTER' instructions at 0xXXCB:

Code: Select all

# pc = 0xXXCB, duplicate of 'REENTER'
bra('NEXT')                     #XX+0 Return from SYS calls
ld([vPC+1],Y)                   #XX+1
So an example instruction that does a byte left shift could be:

LSLB native code

Code: Select all

# pc = 0x1550, Opcode = 0x50, (example address/opcode)
# Instruction LSLB <var>: Logical shift left byte var, var.lo <<= 1, 16 cycles
label('LSLB')
ld(AC,X)	    #10
ld([X])		    #11
adda([X])	    #12
bra('NEXTY')        #13
ld(-16/2)           #14
Unless I am missing something there is no state to be saved or restored, (apart from vCpuSelect for vertical blank interrupts), you could in fact have two versions of SETIP, one for when interrupts are enabled, SETIPI 16 cycles, and one for when interrupts are disabled, SETIP 14 cycles, (which the compiler/programmer could choose depending on their use case).

Advantages:
  • The only state needed to be saved/restored is vCpuSelect and that costs 1 native instruction cycle per SETIP.
  • There is no vPC fix up needed, allowing instructions to be 3 cycles more densely packed.
  • There is no loss of cycles getting into or out of the new instruction, allowing instructions to more densely packed.
  • SETIP only needs to be called when switching pages, PREFIX needs to be called for every instruction not in page 3.
Disadvantages:
  • It costs 2 bytes of code-space for SETIP, but only 1 byte for PREFIX
  • Highly page interleaved code would cause a compiler to continually oscillate SETIP between page3 and other pages.
I haven't tested this code and haven't 100% thought through the entire process, so I may have missed something crucial, please let me know if I did.
lb3361 wrote: 06 Apr 2021, 22:29 But I really like the idea that certain prefixes change the maximum instruction time. That solves both the problem of limited instruction slots and the problem of the maximum instruction execution time. The main drawback is that we lose ~12 cycles to process the prefix and switch page. Compare this to the ~6 cycles one needs to jump to code located in a different page and back. On the other hand, this could be less than what we lose by increasing the execution time of frequent instruction in order to open new instruction slots in page 3.
I don't follow how you can change maxTicks in any meaningful way for longer instructions, 'maxTicks' is a global definition that the runVcpu macro, (and a bunch of other code and macros use), to define a maximum slot size limit. Once defined within the source code it can't be changed again after compile time.

I experimented with values 28, 30 and 32, obviously 28 was the original value and didn't allow for crucial instructions such as 'DEEKX', 30 allows for these crucial instructions to exist and incurs about a 5% overall performance penalty when using 28 as a baseline. But because I was able to move most of the instructions out of page3 into other pages and re-code them taking advantage of the copious amounts of ROM space available, some of the original instructions decreased in cycle count and therefore 30 cycle compared to 28 cycle execution is statistically within +/- 2% on all the applications I tested.

32 cycles on the other hand incurs about a 15% performance penalty across the board and even though 32 allows even more complex instructions to exist I deemed the extra functionality not worth the performance hit.

So I am interested in exactly how you would go about increasing 'maxTicks' for some instructions given the above.
lb3361 wrote: 06 Apr 2021, 22:29 In fact I would even argue to implement the page X dispatcher with jump tables, even if this costs additional cycles. My main concern here is to give more implementation flexibility while keeping the same opcodes. The kind of gymnastics you had to do in page 3 to change the code while keeping the same opcodes seems unwise. Even fixing a minor bug could be a major problem.
I did implement an indirection table in page 3 as my first attempt, it wasn't video cycle error free, but it did perform the task required and what I found was that instruction cycle times ballooned out by an extra 16-20 cycles. It would be interesting to revisit this at some stage and see if it could be done more efficiently.

I actually didn't have to perform any gymnastics in page3 to move instructions to other pages, what I actually had to do was unravel Marcel's magnificent gymnastics and then rewrite all the instructions knowing I had vast amounts of ROM space to play with.

i.e. Marcel originally wrote the code balancing these 3 constraints, page3 byte usage, instruction execution time and number of instructions; this resulted in him producing some amazing code that satisfied all 3 constraints about as well as any Earthly programmer could have probably achieved IMHO.

I on the other hand decided to only prioritise number of instruction slots, so I had to painstakingly unravel Marcel's code, provide simple launch-pad's into other pages of memory for each old and new instruction and then re-code the old instructions using the advantage of massive amounts of ROM space, as you know this led to some old instructions actually executing more quickly. But my implementations of all the old instructions are usually 50% to 100% bigger in byte size compared to Marcel's versions.

Here is what my page3 now looks like, you'll notice there are only 6 instructions that still fully exist within page3, ADDW, SUBW, LUP, SYS, XORI and BRA, (RET doesn't count as it starts at 0x03FF and spills into page 4).
  • ADDW can't be moved out of page3 and remain under 30 cycles, (not by me anyway, I tried and failed multiple times, my attempts are left as comments in the source code).
  • SUBW as above and it contains the REENTER and REENTER_28 labels.
  • LUP would only free up 1 native instruction slot if it was moved, you need a minimum of 2 for a launchpad.
  • SYS can't be moved due to cycle time limitations, (but I said the same thing about BCC), so it may be worth revisiting at some stage.
  • XORI would free up 1 vCPU instruction slot, but it would overlap with the st([vPC]) instruction of the following BRA instruction, which is messy unless you require loading AC into vPC for your new instruction.
  • BRA would only free up 1 native instruction slot if it was moved, but if it was moved it would allow the one vCPU instruction slot in XORI to become available without the modification of vPC caveat, (I am leaving this as a last resort for an instruction worthy enough, as BRA and XORI are very common instructions both executing in 14 cycles, moving them would cause a significant increase in their execution times, so the end result would have to be worth it).
P.S. native bugs are now trivial to fix as the implementation code for each vCPU instruction is simpler to understand, (it's usually just sequential code with multiple paths for each branch), and has no start or size constraints

ROMvX0 PAGE 3

Code: Select all

# pc = 0x0311, Opcode = 0x11
# Instruction LDWI: Load immediate word constant (vAC=D), 24 cycles
label('LDWI')
ld(hi('ldwi#13'),Y)             #10
jmp(Y,'ldwi#13')                #11
ld([vPC+1],Y)                   #12

# pc = 0x0314, Opcode = 0x14
# Instruction DEC: Decrement byte var ([D]--), 22 cycles
label('DEC')
ld(hi('dec#13'),Y)              #10
jmp(Y,'dec#13')                 #11
#dummy                          #12 Overlap
#
# pc = 0x0316, Opcode = 0x16
# Instruction MOVQ: Load a byte var with a small constant 0..255, 28 cycles
label('MOVQ')
ld(hi('movq#13'),Y)             #10 #12
jmp(Y,'movq#13')                #11
#dummy                          #12 Overlap
#
# pc = 0x0318, Opcode = 0x18
# Instruction LSRB: Logical shift right on a byte var, 28 cycles
label('LSRB')
ld(hi('lsrb#13'),Y)             #10 #12
jmp(Y,'lsrb#13')                #11
#dummy                          #12 Overlap

# pc = 0x031a, Opcode = 0x1a
# Instruction LD: Load byte from zero page (vAC=[D]), 22 cycles
label('LD')
ld(hi('ld#13'),Y)               #10 #12
jmp(Y,'ld#13')                  #11
#dummy                          #12 Overlap

# pc = 0x031c, Opcode = 0x1c
# Instruction SEXT: Sign extend vAC based on a variable mask, 28 cycles
label('SEXT')
ld(hi('sext#13'),Y)             #10, #12
jmp(Y,'sext#13')                #11
st([vTmp])                      #12 sign mask

# pc = 0x031f, Opcode = 0x1f
# Instruction CMPHS: Adjust high byte for signed compare (vACH=XXX), 28 cycles
label('CMPHS_v5')
ld(hi('cmphs#13'),Y)            #10
jmp(Y,'cmphs#13')               #11
#dummy                          #12 Overlap, not dependent on ld(AC,X) anymore

# pc = 0x0321, Opcode = 0x21
# Instruction LDW: Load word from zero page (vAC=[D]+256*[D+1]), 24 cycles
label('LDW')
ld(hi('ldw#13'),Y)              #10
jmp(Y,'ldw#13')                 #11
#dummy                          #12 Overlap
# 
# pc = 0x0323, Opcode = 0x23
# Instruction PEEKX: Peek byte at address contained in var, inc var, 30 cycles
label('PEEKX') 
ld(hi('peekx#13'),Y)            #10 #12
jmp(Y,'peekx#13')               #11
#dummy                          #12 Overlap
#
# pc = 0x0325, Opcode = 0x25
# Instruction POKEI: Poke immediate byte into address contained in [vAC], 20 cycles
label('POKEI') 
ld(hi('pokei#13'),Y)            #10 #12
jmp(Y,'pokei#13')               #11
#dummy                          #12 Overlap
# 
# pc = 0x0327, Opcode = 0x27
# Instruction LSLV: Logical shift left word var, 28 cycles
label('LSLV')
ld(hi('lslv#13'),Y)             #10 #12
jmp(Y,'lslv#13')                #11
#dummy                          #12 Overlap
#
# pc = 0x0329, Opcode = 0x29
# Instruction ADDBA: vAC += var.lo, 28 cycles
label('ADDBA')
ld(hi('addba#13'),Y)            #10 #12
jmp(Y,'addba#13')               #11
#dummy                          #12 Overlap
# 
# pc = 0x032b, Opcode = 0x2b
# Instruction STW: Store word in zero page ([D],[D+1]=vAC&255,vAC>>8), 24 cycles
label('STW')
ld(hi('stw#13'),Y)              #10 #12
jmp(Y,'stw#13')                 #11
#dummy                          #12 Overlap
#
# pc = 0x032d, Opcode = 0x2d
# Instruction ADDBI: Add a constant 0..255 to byte var, 28 cycles
label('ADDBI') 
ld(hi('addbi#13'),Y)            #10 #12
jmp(Y,'addbi#13')               #11
#dummy                          #12 Overlap
#
# pc = 0x032f, Opcode = 0x2f
# Instruction XCHG: Exchange byte of [vAC] and [var], 28 cycles
label('XCHG')
ld(hi('xchg#13'),Y)             #10 #12
jmp(Y,'xchg#13')                #11
ld([vPC+1],Y)                   #12
#
# pc = 0x0332, Opcode = 0x32
# Instruction DBNZ:  Decrement byte var and branch if not zero then 26 cycles, 28 cycles on zero
label('DBNZ')
ld(hi('dbnz#13'),Y)             #10
jmp(Y,'dbnz#13')                #11
ld([vPC+1],Y)                   #12 vPC.hi
#
# pc = 0x0335, Opcode = 0x35
# Instruction BCC: Test AC sign and branch conditionally, variable, (24-26), cycles
label('BCC')
bra(AC)                         #10 AC is the conditional operand
st([Y,Xpp])                     #11 X++

# pc = 0x0337, Opcode = 0x37
# Instruction DOKEI: Doke immediate word into address contained in [vAC], 30 cycles
label('DOKEI') 
ld(hi('dokei#13'),Y)            #10
jmp(Y,'dokei#13')               #11
#dummy                          #12 Overlap

# pc = 0x0339, Opcode = 0x39
# Instruction PEEKV: Read byte from address contained in var, 30 cycles
label('PEEKV')
ld(hi('peekv#13'),Y)            #10
jmp(Y,'peekv#13')               #11
#dummy                          #12 Overlap

# pc = 0x033b, Opcode = 0x3b
# Instruction DEEKV: Read word from address contained in var, 28 cycles
label('DEEKV')
ld(hi('deekv#13'),Y)            #10 #12
jmp(Y,'deekv#13')               #11
#dummy                          #12 Overlap

# pc = 0x033d, Opcode = 0x3d
# Instruction XORBI: var.lo ^= imm, 28 cycles
label('XORBI')
ld(hi('xorbi#13'),Y)            #10 #12
jmp(Y,'xorbi#13')               #11
#dummy                          #12 Overlap

# pc = 0x033f, Opcode = 0x3f
# Conditional EQ: Branch if zero (if(vACL==0)vPCL=D)
ld(hi('beq#15'),Y)              #12 #12
jmp(Y,'beq#15')                 #13
ld([vPC+1],Y)                   #14 vPC.hi

# pc = 0x0342, Opcode = 0x42
# Instruction ANDBA: vAC &= var.lo, 24 cycles
label('ANDBA')
ld(hi('andba#13'),Y)            #10 #12
jmp(Y,'andba#13')               #11
#dummy                          #12 Overlap

# pc = 0x0344, Opcode = 0x44
# Instruction ORBA: vAC |= var.lo, 22 cycles
label('ORBA')
ld(hi('orba#13'),Y)             #10 #12
jmp(Y,'orba#13')                #11
#dummy                          #12 Overlap

# pc = 0x0346, Opcode = 0x46
# Instruction XORBA: vAC ^= var.lo, 22 cycles
label('XORBA')
ld(hi('xorba#13'),Y)            #10 #12
jmp(Y,'xorba#13')               #11
#dummy                          #12 Overlap

# pc = 0x0348, Opcode = 0x48
# Instruction NOTB: var.lo = ~var.lo, 22 cycles
label('NOTB') 
ld(hi('notb#13'),Y)             #10 #12
jmp(Y,'notb#13')                #11
#dummy                          #12 Overlap

# pc = 0x034a, Opcode = 0x4a
# Instruction DOKEX: doke word in vAC to address contained in var, var += 2, 30 cycles
label('DOKEX') 
ld(hi('dokex#13'),Y)            #10 #12
jmp(Y,'dokex#13')               #11
ld(AC,X)                        #12

# pc = 0x034d, Opcode = 0x4d
# Conditional GT: Branch if positive (if(vACL>0)vPCL=D)
ld(hi('bgt#15'),Y)              #12
jmp(Y,'bgt#15')                 #13
ld([vPC+1],Y)                   #14 vPC.hi

# pc = 0x0350, Opcode = 0x50
# Conditional LT: Branch if negative (if(vACL<0)vPCL=D)
ld(hi('blt#15'),Y)              #12
jmp(Y,'blt#15')                 #13
ld([vPC+1],Y)                   #14 vPC.hi

# pc = 0x0353, Opcode = 0x53
# Conditional GE: Branch if positive or zero (if(vACL>=0)vPCL=D)
ld(hi('bge#15'),Y)              #12
jmp(Y,'bge#15')                 #13
ld([vPC+1],Y)                   #14 vPC.hi

# pc = 0x0356, Opcode = 0x56
# Conditional LE: Branch if negative or zero (if(vACL<=0)vPCL=D)
ld(hi('ble#15'),Y)              #12
jmp(Y,'ble#15')                 #13
ld([vPC+1],Y)                   #14 vPC.hi

# pc = 0x0359, Opcode = 0x59
# Instruction LDI: Load immediate small positive constant (vAC=D), 20 cycles
label('LDI')
ld(hi('ldi#13'),Y)              #10
jmp(Y,'ldi#13')                 #11
#dummy                          #12 Overlap
#
# pc = 0x035b, Opcode = 0x5b
# Instruction MOVQW: Load a word var with a small constant 0..255, 30 cycles
label('MOVQW')
ld(hi('movqw#13'),Y)            #10 #12
jmp(Y,'movqw#13')               #11
ld([vPC+1],Y)                   #12 vPC.hi

# pc = 0x035e, Opcode = 0x5e
# Instruction ST: Store byte in zero page ([D]=vAC&255), 20 cycles
label('ST')
ld(hi('st#13'),Y)               #10
jmp(Y,'st#13')                  #11
#dummy                          #12 Overlap
#
# pc = 0x0360, Opcode = 0x60
# Instruction DEEKX: Deek word at address contained in var, var += 2, 30 cycles
label('DEEKX') 
ld(hi('deekx#13'),Y)            #10 #12
jmp(Y,'deekx#13')               #11
ld(0,Y)                         #12

# pc = 0x0363, Opcode = 0x63
# Instruction POP: Pop address from stack (vLR,vSP==[vSP]+256*[vSP+1],vSP+2), 30 cycles
label('POP')
ld(hi('pop#13'),Y)              #10
jmp(Y,'pop#13')                 #11
#dummy                          #12 Overlap
#
# pc = 0x0365, Opcode = 0x65
# Instruction MOV: Moves a byte from src var to dst var, 28 cycles
label('MOV')
ld(hi('mov#13'),Y)              #10
jmp(Y,'mov#13')                 #11
#dummy                          #12 Overlap
#
# pc = 0x0367, Opcode = 0x67
# Instruction PEEKA: Peek a byte from [AC] to var, 24 cycles
label('PEEKA') 
ld(hi('peeka#13'),Y)             #10 #12
jmp(Y,'peeka#13')                #11
#dummy                          #12 Overlap
#
# pc = 0x0369, Opcode = 0x69
# Instruction POKEA: Poke a byte from var to [vAC], 22 cycles
label('POKEA') 
ld(hi('pokea#13'),Y)             #10 #12
jmp(Y,'pokea#13')                #11
#dummy                          #12 Overlap

# pc = 0x036b, Opcode = 0x6b
# Instruction TEQ: Test for EQ, returns 0x0000 or 0x0101 in vAC, 28 cycles
label('TEQ')
ld(hi('teq#13'),Y)              #10 #12
jmp(Y,'teq#13')                 #11
#dummy                          #12 Overlap
#
# pc = 0x036d, Opcode = 0x6d
# Instruction TNE: Test for NE, returns 0x0000 or 0x0101 in vAC, 28 cycles
label('TNE')
ld(hi('tne#13'),Y)              #10 #12
jmp(Y,'tne#13')                 #11
#dummy                          #12 Overlap
#
# pc = 0x036f, Opcode = 0x6f
# Instruction DEEKA: Move a word from [AC] to var, 30 cycles
label('DEEKA')
ld(hi('deeka#13'),Y)            #10, #12
jmp(Y,'deeka#13')               #11
st([vTmp])                      #12 mask

# pc = 0x0372, Opcode = 0x72
# Conditional NE: Branch if not zero (if(vACL!=0)vPCL=D)
ld(hi('bne#15'),Y)              #12
jmp(Y,'bne#15')                 #13
ld([vPC+1],Y)                   #14 vPC.hi

# pc = 0x0375, Opcode = 0x75
# Instruction PUSH: Push vLR on stack ([vSP-2],v[vSP-1],vSP=vLR&255,vLR>>8,vLR-2), 30 cycles
label('PUSH')
ld(hi('push#13'),Y)             #10
jmp(Y,'push#13')                #11
#dummy                          #12 Overlap
#
# pc = 0x0377, Opcode = 0x77
# Instruction SUBBA: vAC -= var.lo, 28 cycles
label('SUBBA')
ld(hi('subba#13'),Y)            #10 #12
jmp(Y,'subba#13')               #11
#dummy                          #12 Overlap
#
# pc = 0x0379, Opcode = 0x79
# Instruction INCW: Increment word var, 26 cycles
label('INCW')
ld(hi('incw#13'),Y)             #10
jmp(Y,'incw#13')                #11
#dummy                          #12 Overlap
#
# pc = 0x037b, Opcode = 0x7b
# Instruction DECW: Decrement word var, 26 cycles
label('DECW')
ld(hi('decw#13'),Y)             #10 #12
jmp(Y,'decw#13')                #11
#dummy                          #12 Overlap
#
# pc = 0x037d, Opcode = 0x7d
# Instruction DOKEA: Doke a word from var to [vAC], 30 cycles
label('DOKEA') 
ld(hi('dokea#13'),Y)            #10 #12
jmp(Y,'dokea#13')               #11
#dummy                          #12 Overlap

# pc = 0x037f, Opcode = 0x7f
# Instruction LUP: ROM lookup (vAC=ROM[vAC+D]), 26 cycles
label('LUP')
ld([vAC+1],Y)                   #10
jmp(Y,251)                      #11 Trampoline offset
adda([vAC])                     #12

# pc = 0x0382, Opcode = 0x82
# Instruction ANDI: Logical-AND with small constant (vAC&=D), 20 cycles
label('ANDI')
ld(hi('andi#13'),Y)             #10
jmp(Y,'andi#13')                #11
anda([vAC])                     #12

# pc = 0x0385, Opcode = 0x85
# Instruction CALLI: Goto immediate address and remember vPC (vLR,vPC=vPC+3,$HHLL-2), 28 cycles
label('CALLI_v5')
ld(hi('calli#13'),Y)            #10
jmp(Y,'calli#13')               #11
ld([vPC])                       #12

# pc = 0x0388, Opcode = 0x88
# Instruction ORI: Logical-OR with small constant (vAC|=D), 20 cycles
label('ORI')
ld(hi('ori#13'),Y)              #10
jmp(Y,'ori#13')                 #11
#dummy                          #12 Overlap
#
# pc = 0x038a, Opcode = 0x8a
# Instruction NOTW: Boolean invert var
label('NOTW')
ld(hi('notw#13'),Y)             #10
jmp(Y,'notw#13')                #11
#dummy                          #12 Overlap
#
# pc = 0x038c, Opcode = 0x8c
# Instruction XORI: Logical-XOR with small constant (vAC^=D), 14 cycles
label('XORI')
xora([vAC])                     #10 #12
st([vAC])                       #11
bra('NEXT')                     #12
ld(-14/2)                       #13

# pc = 0x0390, Opcode = 0x90
# Instruction BRA: Branch unconditionally (vPC=(vPC&0xff00)+D), 14 cycles
label('BRA')
st([vPC])                       #10 #12
bra('NEXTY')                    #11
ld(-14/2)                       #12

# pc = 0x0393, Opcode = 0x93
# Instruction INC: Increment zero page byte ([D]++), 20 cycles
label('INC')
ld(hi('inc#13'),Y)              #10
jmp(Y,'inc#13')                 #11
#dummy                          #12 Overlap
#
# pc = 0x0395, Opcode = 0x95
# Instruction ORBI: OR immediate byte with byte var, result in byte var, 28 cycles
label('ORBI')
ld(hi('orbi#13'),Y)             #10 #12
jmp(Y,'orbi#13')                #11
#dummy                          #12 Overlap
#
# pc = 0x0397, Opcode = 0x97
# Instruction CMPHU: Adjust high byte for unsigned compare (vACH=XXX), 28 cycles
label('CMPHU_v5')
ld(hi('cmphu#13'),Y)            #10
jmp(Y,'cmphu#13')               #11
#dummy                          #12 Overlap, not dependent on ld(AC,X) anymore
#
# pc = 0x0399, Opcode = 0x99
# Instruction ADDW: Word addition with zero page (vAC+=[D]+256*[D+1]), 28 cycles
label('ADDW')
# The non-carry paths could be 26 cycles at the expense of (much) more code.
# But a smaller size is better so more instructions fit in this code page.
# 28 cycles is still 4.5 usec. The 6502 equivalent takes 20 cycles or 20 usec.
ld(AC,X)                        #10,12 Address of low byte to be added
adda(1)                         #11
st([vTmp])                      #12 Address of high byte to be added
ld([vAC])                       #13 Add the low bytes
adda([X])                       #14
st([vAC])                       #15 Store low result
bmi('.addw#18')                 #16 Now figure out if there was a carry
suba([X])                       #17 Gets back the initial value of vAC
bra('.addw#20')                 #18
ora([X])                        #19 Carry in bit 7
label('.addw#18')
anda([X])                       #18 Carry in bit 7
nop()                           #19
label('.addw#20')
anda(0x80,X)                    #20 Move carry to bit 0
ld([X])                         #21
adda([vAC+1])                   #22 Add the high bytes with carry
ld([vTmp],X)                    #23
adda([X])                       #24
st([vAC+1])                     #25 Store high result
bra('NEXT')                     #26
ld(-28/2)                       #27

# pc = 0x0399, Opcode = 0x99
# Instruction ADDW: Word addition with zero page (vAC+=[D]+256*[D+1]), 30 cycles
#label('ADDW')
#ld(hi('addw#13'),Y)             #10 #12
#jmp(Y,'addw#13')                #11
#ld(0,Y)                         #12
#
#fillers(until=0xad)

# pc = 0x03ad, Opcode = 0xad
# Instruction PEEK: Read byte from memory (vAC=[vAC]), 26 cycles
label('PEEK')
ld(hi('peek#13'),Y)             #10
jmp(Y,'peek#13')                #11
#ld([vPC])                      #12 Overlap
#
# pc = 0x03b4, Opcode = 0xb4
# Instruction SYS: Native call, <=256 cycles (<=128 ticks, in reality less)
#
# The 'SYS' vCPU instruction first checks the number of desired ticks given by
# the operand. As long as there are insufficient ticks available in the current
# time slice, the instruction will be retried. This will effectively wait for
# the next scan line if the current slice is almost out of time. Then a jump to
# native code is made. This code can do whatever it wants, but it must return
# to the 'REENTER' label when done. When returning, AC must hold (the negative
# of) the actual consumed number of whole ticks for the entire virtual
# instruction cycle (from NEXT to NEXT). This duration may not exceed the prior
# declared duration in the operand + 28 (or maxTicks). The operand specifies the
# (negative) of the maximum number of *extra* ticks that the native call will
# need. The GCL compiler automatically makes this calculation from gross number
# of cycles to excess number of ticks.
# SYS functions can modify vPC to implement repetition. For example to split
# up work into multiple chucks.
label('.sys#13')
ld([vPC])                       #13,12 Retry until sufficient time
suba(2)                         #14
st([vPC])                       #15
bra('REENTER')                  #16
ld(-20/2)                       #17
label('SYS')
adda([vTicks])                  #10
blt('.sys#13')                  #11
ld([sysFn+1],Y)                 #12
jmp(Y,[sysFn])                  #13
#dummy()                        #14 Overlap
#
# pc = 0x03b8, Opcode = 0xb8
# Instruction SUBW: Word subtract with zero page (AC-=[D]+256*[D+1]), 28 cycles
# All cases can be done in 26 cycles, but the code will become much larger
label('SUBW')
ld(AC,X)                        #10,14 Address of low byte to be subtracted
adda(1)                         #11
st([vTmp])                      #12 Address of high byte to be subtracted
ld([vAC])                       #13
bmi('.subw#16')                 #14
suba([X])                       #15
st([vAC])                       #16 Store low result
bra('.subw#19')                 #17
ora([X])                        #18 Carry in bit 7
label('.subw#16')
st([vAC])                       #16 Store low result
anda([X])                       #17 Carry in bit 7
nop()                           #18
label('.subw#19')
anda(0x80,X)                    #19 Move carry to bit 0
ld([vAC+1])                     #20
suba([X])                       #21
ld([vTmp],X)                    #22
suba([X])                       #23
st([vAC+1])                     #24
label('REENTER_28')
ld(-28/2)                       #25
label('REENTER')
bra('NEXT')                     #26 Return from SYS calls
ld([vPC+1],Y)                   #27

#
# The instructions below are all implemented in the second code page. Jumping
# back and forth makes each 6 cycles slower, but it also saves space in the
# primary page for the instructions above. Most of them are in fact not very
# critical, as evidenced by the fact that they weren't needed for the first
# Gigatron applications (Snake, Racer, Mandelbrot, Loader). By providing them
# in this way, at least they don't need to be implemented as a SYS extension.
#
# pc = 0x03cd, Opcode = 0xcd
# Instruction DEF: Define data or code (vAC,vPC=vPC+2,(vPC&0xff00)+D), 26 cycles
label('DEF')
ld(hi('def#13'),Y)              #10
jmp(Y,'def#13')                 #11
#dummy                          #12 Overlap
#
# pc = 0x03cf, Opcode = 0xcf
# Instruction CALL: Goto address and remember vPC (vLR,vPC=vPC+2,[D]+256*[D+1]-2), 30 cycles
label('CALL')
ld(hi('call#13'),Y)             #10, #12
jmp(Y,'call#13')                #11
#dummy                          #12 Overlap
#
# pc = 0x03d1, Opcode = 0xd1
# Instruction POKEX: Poke byte in vAC to address contained in var, inc var, 30 cycles
label('POKEX') 
ld(hi('pokex#13'),Y)            #10 #12
jmp(Y,'pokex#13')               #11
#dummy                          #12 Overlap
#
# pc = 0x03d3, Opcode = 0xd3
# Instruction NEGW: Arithmetic negate var
label('NEGW')
ld(hi('negw#13'),Y)             #10, #12
jmp(Y,'negw#13')                #11
#dummy                          #12 Overlap
#
# pc = 0x03d5, Opcode = 0xd5
# Instruction TGE: Test for GE, returns 0x0000 or 0x0101 in vAC, 26 cycles
label('TGE')
ld(hi('tge#13'),Y)              #10 #12
jmp(Y,'tge#13')                 #11
#dummy                          #12 Overlap
#
# pc = 0x03d7, Opcode = 0xd7
# Instruction TLT: Test for LT, returns 0x0000 or 0x0101 in vAC, 26 cycles
label('TLT')
ld(hi('tlt#13'),Y)             #10 #12
jmp(Y,'tlt#13')                #11
#dummy                         #12 Overlap
#
# pc = 0x03d9, Opcode = 0xd9
# Instruction TGT: Test for GT, returns 0x0000 or 0x0101 in vAC, 28 cycles
label('TGT')
ld(hi('tgt#13'),Y)             #10 #12
jmp(Y,'tgt#13')                #11
#dummy                         #12 Overlap
#
# pc = 0x03db, Opcode = 0xdb
# Instruction TLE: Test for LE, returns 0x0000 or 0x0101 in vAC
label('TLE')
ld(hi('tle#13'),Y)             #10 #12
jmp(Y,'tle#13')                #11
#dummy                         #12 Overlap
#
# pc = 0x03dd, Opcode = 0xdd
# Instruction ANDBI: And immediate byte with byte var, result in byte var, 28 cycles
label('ANDBI')
ld(hi('andbi#13'),Y)           #10 #12
jmp(Y,'andbi#13')              #11
#dummy                         #12 Overlap
#
# pc = 0x03df, Opcode = 0xdf
# Instruction ALLOC: Create or destroy stack frame (vSP+=D), 20 cycles
label('ALLOC')
ld(hi('alloc#13'),Y)           #10
jmp(Y,'alloc#13')              #11
#dummy                         #12 Overlap
#
# pc = 0x03e1, Opcode = 0xe1
# Instruction SUBBI: Subtract a constant 0..255 from a byte var, 28 cycles
label('SUBBI')
ld(hi('subbi#13'),Y)            #10 #12
jmp(Y,'subbi#13')               #11
#dummy                          #12 Overlap
#
# pc = 0x03e3, Opcode = 0xe3
# Instruction ADDI: Add small positive constant (vAC+=D), 26 cycles
label('ADDI')
ld(hi('addi#13'),Y)             #10 #12
jmp(Y,'addi#13')                #11
st([vTmp])                      #12

# pc = 0x03e6, Opcode = 0xe6
# Instruction SUBI: Subtract small positive constant (vAC+=D), 26 cycles
label('SUBI')
ld(hi('subi#13'),Y)             #10
jmp(Y,'subi#13')                #11
st([vTmp])                      #12

# pc = 0x03e9, Opcode = 0xe9
# Instruction LSLW: Logical shift left (vAC<<=1), 28 cycles
# Useful, because ADDW can't add vAC to itself. Also more compact.
label('LSLW')
ld(hi('lslw#13'),Y)             #10
jmp(Y,'lslw#13')                #11
ld([vAC])                       #12

# pc = 0x03ec, Opcode = 0xec
# Instruction STLW: Store word in stack frame ([vSP+D],[vSP+D+1]=vAC&255,vAC>>8), 24 cycles
label('STLW')
ld(hi('stlw#13'),Y)             #10
jmp(Y,'stlw#13')                #11
#dummy()                        #12 Overlap
#
# pc = 0x03ee, Opcode = 0xee
# Instruction LDLW: Load word from stack frame (vAC=[vSP+D]+256*[vSP+D+1]), 24 cycles
label('LDLW')
ld(hi('ldlw#13'),Y)             #10,12
jmp(Y,'ldlw#13')                #11
#dummy()                        #12 Overlap
#
# pc = 0x03f0, Opcode = 0xf0
# Instruction POKE: Write byte in memory ([[D+1],[D]]=vAC&255), 26 cycles
label('POKE')
ld(hi('poke#13'),Y)             #10,12
jmp(Y,'poke#13')                #11
st([vTmp])                      #12

# pc = 0x03f3, Opcode = 0xf3
# Instruction DOKE: Write word in memory ([[D+1],[D]],[[D+1],[D]+1]=vAC&255,vAC>>8), 28 cycles
label('DOKE')
ld(hi('doke#13'),Y)             #10
jmp(Y,'doke#13')                #11
st([vTmp])                      #12

# pc = 0x03f6, Opcode = 0xf6
# Instruction DEEK: Read word from memory (vAC=[vAC]+256*[vAC+1]), 28 cycles
label('DEEK')
ld(hi('deek#13'),Y)             #10
jmp(Y,'deek#13')                #11
#dummy()                        #12 Overlap
#
# pc = 0x03f8, Opcode = 0xf8
# Instruction ANDW: Word logical-AND with zero page (vAC&=[D]+256*[D+1]), 28 cycles
label('ANDW')
ld(hi('andw#13'),Y)             #10,12
jmp(Y,'andw#13')                #11
#dummy()                        #12 Overlap
#
# pc = 0x03fa, Opcode = 0xfa
# Instruction ORW: Word logical-OR with zero page (vAC|=[D]+256*[D+1]), 28 cycles
label('ORW')
ld(hi('orw#13'),Y)              #10,12
jmp(Y,'orw#13')                 #11
#dummy()                        #12 Overlap
#
# pc = 0x03fc, Opcode = 0xfc
# Instruction XORW: Word logical-XOR with zero page (vAC^=[D]+256*[D+1]), 28 cycles
label('XORW')
ld(hi('xorw#13'),Y)             #10,12
jmp(Y,'xorw#13')                #11
ld(AC,X)                        #12

# pc = 0x03ff, Opcode = 0xff
# Instruction RET: Function return (vPC=vLR-2), 16 cycles
label('RET')
ld([vLR])                       #10
assert pc()&255 == 0
lb3361
Posts: 360
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

Sorry for my lack of clarity. Page switching with SETIP would surely work. I just meant to say that I think that it is dangerous. With SETIP, you can no longer interpret our even disassemble a piece of VCPU code without knowing which page should be active when you enter it. Whenever you jump to a label, you must either know which page is expected (is it the same page as that of the jump instruction) or be certain that the code will immediately SETIP the right page. This is why I believe that resetting to page 3 after each instruction is preferable, even if it costs a couple cycles to do so.
lb3361
Posts: 360
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

About maxTicks.

The code that really determines the maximum execution time is the dispatcher test that branches to EXIT when the tick count turns negative. What runVcpu does is initializing the tick count in such a way that it turns negative when there are less than maxTicks ticks left. But suppose you branch to EXIT when the tick count passes below 2 instead of zero. This means that you won't start an instruction unless there are maxTicks+2 ticks left. Therefore instructions can safely last maxTicks+2 ticks. The v6502 does thinks like that I recall...
Last edited by lb3361 on 11 Apr 2021, 18:18, edited 1 time in total.
lb3361
Posts: 360
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

About maxTicks.

The code that really determines the maximum execution time is the dispatcher test that branches to EXIT when the tick count turns negative. What runVcpu does is initializing the tick count in such a way that it turns negative when there are less than maxTicks ticks left. But suppose you branch to EXIT when the tick count passes below 2 instead of zero. This means that you won't start an instruction unless there are maxTicks+2 ticks left. Therefore instructions can safely last maxTicks+2 ticks. The v6502 does thinks like that I recall...
at67
Site Admin
Posts: 647
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

lb3361 wrote: 08 Apr 2021, 22:45 About maxTicks.

The code that really determines the maximum execution time is the dispatcher test that branches to EXIT when the tick count turns negative. What runVcpu does is initializing the tick count in such a way that it turns negative when there are less than maxTicks ticks left. But suppose you branch to EXIT when the tick count passes below 2 instead of zero. This means that you won't start an instruction unless there are maxTicks+2 ticks left. Therefore instructions can safely last maxTicks+2 ticks. The v6502 does thinks like that I recall...
This is interesting, when I get time I'll have a play with this idea.
lb3361
Posts: 360
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

Reading your page3. It seems that the Bcc instructions no longer need three bytes (e.g. BEQ = 35 3f xx). One can get the same effect without the 35 prefix.

Also a question: PEEKX/POKEX do INC or INCW after peeking or poking?
at67
Site Admin
Posts: 647
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

lb3361 wrote: 12 Apr 2021, 11:21 Reading your page3. It seems that the Bcc instructions no longer need three bytes (e.g. BEQ = 35 3f xx). One can get the same effect without the 35 prefix.
Yeah but that would break compatibility unfortunately.
lb3361 wrote: 12 Apr 2021, 11:21 Also a question: PEEKX/POKEX do INC or INCW after peeking or poking?
PEEKX/DEEKX are 30 cycles using INC
POKEX is 28 using INC
DOKEX is 30 using INC

Code: Select all

# PEEKX implementation
label('peekx#13')
ld(0,Y)                         #13
ld(AC,X)                        #14
ld([X])                         #15 low byte peek address
st([vTmp])                      #16
adda(1)                         #17
st([Y,Xpp])                     #18
ld([X])                         #19 high byte peek address
ld(AC,Y)                        #20
ld([vTmp],X)                    #21
ld([Y,X])                       #22
st([vAC])                       #23
ld(0)                           #24
st([vAC+1])                     #25
ld(hi('NEXTY'),Y)               #26
jmp(Y,'NEXTY')                  #27
ld(-30/2)                       #28

# DEEKX implementation
label('deekx#13')
ld(AC,X)                        #13
ld([X])                         #14 low byte deek address
st([vTmp])                      #15
adda(2)                         #16
st([Y,Xpp])                     #17
ld([X])                         #18 high byte deek address
ld(AC,Y)                        #19
ld([vTmp],X)                    #20
ld([Y,X])                       #21
st([Y,Xpp])                     #22 X++
st([vAC])                       #23
ld([Y,X])                       #24
st([vAC+1])                     #25
ld(hi('NEXTY'),Y)               #26
jmp(Y,'NEXTY')                  #27
ld(-30/2)                       #28

# POKEX implementation
label('pokex#13')
ld(AC,X)                        #13 Operand
ld(0,Y)                         #14    
ld([X])                         #15 low byte poke address
st([vTmp])                      #16
adda(1)                         #17
st([Y,Xpp])                     #18
ld([X])                         #19 high byte poke address
ld(AC,Y)                        #20
ld([vTmp],X)                    #21
ld([vAC])                       #22
st([Y,X])                       #23
ld(hi('NEXTY'),Y)               #24
jmp(Y,'NEXTY')                  #25
ld(-28/2)                       #26

# DOKEX implementation
label('dokex#13')
ld(0,Y)                         #13
ld([X])                         #14 low byte poke address
st([vTmp])                      #15
adda(2)                         #16
st([Y,Xpp])                     #17
ld([X])                         #18 high byte poke address
ld(AC,Y)                        #19
ld([vTmp],X)                    #20
ld([vAC])                       #21
st([Y,Xpp])                     #22
ld([vAC+1])                     #23
st([Y,X])                       #24
ld(hi('REENTER'),Y)             #25
jmp(Y,'REENTER')                #26
ld(-30/2)                       #27
Post Reply