Hi All,
I'm starting to think about implementing a Tiny Forth on the Gigatron vCPU.
Forth requires 2 stacks - the data stack and the return stack. If we use the usual vCPU stack with it's stack pointer at 0x1C as the data stack, is there a way of creating a second stack for the return stack with it's own return stack pointer at a different zero page location?
For this to work, we will need indirect addressing with one location acting as a pointer to the word we want to access.
With DEEK and DOKE we can access a word that is referenced by the vAC. Is it a case of first priming the vAC with the contents of the required pointer?
Is there a neater way of doing this with the vCPU?
Any help appreciated.
Ken
Indirect Addressing?
Forum rules
Be nice. No drama.
Be nice. No drama.
Re: Indirect Addressing?
I believe that's the primary way. If you need to reduce code density, it can be hidden in subroutines for the primitives. Another approach can be to switch vSP's contents before using the instructions that operate on it. I haven't thought that through.
A completely different approach is to replace/augment vCPU with one optimised for FORTH operations. Maybe in a distant future though... (Perhaps there's a middle road and put some primitives in SYS functions that can be called from vCPU.)
A completely different approach is to replace/augment vCPU with one optimised for FORTH operations. Maybe in a distant future though... (Perhaps there's a middle road and put some primitives in SYS functions that can be called from vCPU.)
Re: Indirect Addressing?
Hi All,
Now that we have a cpu capable of 12.5MHz (at least) and the addition of 200 vCPU cycles on each line of video, I decided that it was time to start looking again at some sort of Forth-like interpreted language.
First off, I need to be able to simulate the vCPU behaviour - so I have started on a simple instruction simulator written in Arduino code. The main reason for choosing the Arduino, is that the IDE is widely available and opens up a whole range of processors not just AVR. Another reason is that using the millis() and micros() functions, blocks of simulated instructions can be timed accurately.
The simulator is a work in progress, and there are still half a dozen instructions still to code (LUP, SYS, DEF, ALLOC, CALL, RET), but so far it's useful enough to test snippets of vCPU assembly to make sure they implement the correct stack behaviour needed for a stack based language like Forth.
Forth is often implemented as a small set of primitive instructions - coded up in the assembly language of the target processor. These primitives perform stack manipulation, arithmetic and logic functions, memory access, I/O, and program flow structures - and usually consist of short, self contained snippets of assembly language. This simulator is intended as a test bed for these isolated code snippets - so that they can be individually tested for correct operation, before being assembled into a larger program.
Initially I am concerned solely with stack operations, arithmetic, logic and memory access. The only I/O for the moment is via a serial terminal. Once the basics are in place, the language can be extended to include the Gigatron video generation.
The other idea I wish to implement is a set of pseudo registers in zero-page RAM. If vAC is register R0, then I see no reason why further 16-bit registers cannot be implemented in RAM - and given names (eg R1-R15) to make working in assembly language easier. This was inspired by Steve Wozniak's "Sweet16" and the regular register structure of the PDP-11 and MSP430 etc. They won't be the fastest access registers - but at least will be memorable.
As mentioned earlier in this thread, the vCPU instruction set is not ideal for stack manipulation and the code to take two numbers off the stack, add them together and return the sum to the stack is something like 16 instructions - see code window below:
If anyone would like to play about with simulating the VCPU - I have attached the draft code below, which simulates the stack addition above and prints out address and data values of RAM. The Data Stack Pointer is implemennted at address 0x20 and the Datastack grows down from address 0x30. When a bit more polished the code will appear on Github.
Now that we have a cpu capable of 12.5MHz (at least) and the addition of 200 vCPU cycles on each line of video, I decided that it was time to start looking again at some sort of Forth-like interpreted language.
First off, I need to be able to simulate the vCPU behaviour - so I have started on a simple instruction simulator written in Arduino code. The main reason for choosing the Arduino, is that the IDE is widely available and opens up a whole range of processors not just AVR. Another reason is that using the millis() and micros() functions, blocks of simulated instructions can be timed accurately.
The simulator is a work in progress, and there are still half a dozen instructions still to code (LUP, SYS, DEF, ALLOC, CALL, RET), but so far it's useful enough to test snippets of vCPU assembly to make sure they implement the correct stack behaviour needed for a stack based language like Forth.
Forth is often implemented as a small set of primitive instructions - coded up in the assembly language of the target processor. These primitives perform stack manipulation, arithmetic and logic functions, memory access, I/O, and program flow structures - and usually consist of short, self contained snippets of assembly language. This simulator is intended as a test bed for these isolated code snippets - so that they can be individually tested for correct operation, before being assembled into a larger program.
Initially I am concerned solely with stack operations, arithmetic, logic and memory access. The only I/O for the moment is via a serial terminal. Once the basics are in place, the language can be extended to include the Gigatron video generation.
The other idea I wish to implement is a set of pseudo registers in zero-page RAM. If vAC is register R0, then I see no reason why further 16-bit registers cannot be implemented in RAM - and given names (eg R1-R15) to make working in assembly language easier. This was inspired by Steve Wozniak's "Sweet16" and the regular register structure of the PDP-11 and MSP430 etc. They won't be the fastest access registers - but at least will be memorable.
As mentioned earlier in this thread, the vCPU instruction set is not ideal for stack manipulation and the code to take two numbers off the stack, add them together and return the sum to the stack is something like 16 instructions - see code window below:
Code: Select all
0x1A, // LD DSP DSP = Data stack pointer 0x20 1A 20
0x20,
0xF6, // DEEK vAC = [DSP] + 256*[DSP+1} F6
0x2B, // STW TOS TOS = Top of Stack 2B 30
0x30,
0x1A, // LD DSP Get DSP 1A 20
0x20,
0xE6, // SUB 02 Subtract 2 E6 02
0x02,
0x5E, // ST DSP Store back 5E 20
0x20,
0xF6, // DEEK vAC = second value on stack NOS F6
0x99, // ADDW TOS ADD TOS and NOS 99 30
0x30,
0xF3, // DOKE DSP Store sum to new stack top F3 20
0x20,
0x00, // addr 0x10
Code: Select all
// Gigatron vCPU Simulator
// Ken Boak April 28th 2019
// This attempts to simulate the vCPU instructions - so that short snippets of VCPU code may
// be written and tested - updating the vCPU accumulator vAC and any of the zero-page and other
// memory locations.
// Memory will be defined as an 8-bit array with the opcode in byte 1 and data in byte 2
// The following registers will be defined:
// 0016-0017 vPC Interpreter program counter, points into RAM
// 0018-0019 vAC Interpreter accumulator, 16-bits
// 001a-001b vLR Return address, for returning after CALL
// 001c vSP Stack pointer
// 001d (vTmp) Scratch storage location for vCPU
// 001e (vReturn) Return address (L) from vCPU into the loop (H is fixed)
//
/* List of Gigatron vCPU opcodes
"LDWI" 0x11 LDWI $DDDD Load immediate arbitrary constant (vAC=D)
"LD" 0x1A LD $DD Load byte from zero page (vAC=[D])
"ST" 0x5E ST $DD Store byte in zero page ([D]=vAC)
"LDW" 0x21 LDW $DD Word load from zero page (vAC=[D]+256*[D+1])
"STW" 0x2B STW $DD Store word into zero page ([D]=vAC&255,[D+1]=vAC>>8)
"STLW" 0xEC STLW $DD Store word in stack frame (vSP[D],vSP[D+1]=vAC&255,vAC>>8)
"LDLW" 0xEE LDLW $DD Load word from stack frame (vAC=vSP[D]+256*vSP[D+1])
"PEEK" 0xAD PEEK - Read byte from memory (vAC=[vAC])
"POKE" 0xF0 POKE $DD Write byte in memory ([[D+1],[D]]=vAC&255)
"DEEK" 0xF6 DEEK - Read word from memory (vAC=[vAC]+256*[vAC+1])
"DOKE" 0xF3 DOKE $DD Write word in memory ([[D+1],[D]],[[D+1],[D]+1]=vAC&255,vAC>>8)
"INC" 0x93 INC $DD Increment zero page byte ([D]++)
"BRA" 0x90 BRA $DD Branch unconditionally (vPC=(vPC&0xff00)+D)
"BCC" 0x35 BCC $CC $DD Test vAC and branch conditionally. CC can be EQ,NE,LT,GT,LE,GE
"EQ" 0x3F
"GT" 0x4D
"LT" 0x50
"GE" 0x53
"LE" 0x56
"NE" 0x72
"LDI" 0x59 LDI $DD Load immediate small positive constant (vAC=D)
"ADDI" 0xE3 ADDI $DD Add small positive constant (vAC+=D)
"SUBI" 0xE6 SUBI $DD Subtract small positive constant (vAC-=D)
"ANDI" 0x82 ANDI $DD Logical-AND with constant (vAC&=D)
"ORI" 0x88 ORI $DD Logical-OR with constant (vAC|=D)
"XORI" 0x8C XORI $DD Logical-XOR with constant (vAC^=D)
"ADDW" 0x99 ADDW $DD Word addition with zero page (vAC+=[D]+256*[D+1])
"SUBW" 0xB8 SUBW $DD Word subtraction with zero page (vAC-=[D]+256*[D+1])
"ANDW" 0xF8 ANDW $DD Word logical-AND with zero page (vAC&=[D]+256*[D+1])
"ORW" 0xFA ORW $DD Word logical-OR with zero page (vAC|=[D]+256*[D+1])
"XORW" 0xFC XORW $DD Word logical-XOR with zero page (vAC^=[D]+256*[D+1])
"POP" 0x63 POP - Pop value from stack (vAC=[vSP]+256*[vSP+1],vSP+=2)
"PUSH" 0x75 PUSH - Push vLR on stack ([--vSP]=vLR&255,[--vSP]=vLR>>8)
"LUP" 0x7F LUP $DD ROM lookup (vAC=ROM[D,AC])
"SYS" 0xB4 SYS $DD Native function call using at most 2*T cycles, D=270-max(14,T)
"DEF" 0xCD DEF $DD Define data or code (vAC,vPC=vPC+2,D+256*(vPC>>8))
"ALLOC" 0xDF ALLOC $DD Create or destroy stack frame (vSP+=D)
"LSLW" 0xE9 LSLW - Shift left (because 'ADDW vAC' will not work!) (vAC+=vAC)
"CALL" 0xCF CALL $DD Goto address but remember vPC (vLR,vPC=vPC+2,[D]+256*[D+1]-2)
"RET" 0xFF RET - Leaf return (vPC=vLR-2)
With the simulation model in place it would be good to be able to evaluate small snippets of code
// LD DSP // DSP = Data stack pointer 1A 20
// DEEK // vAC = [DSP] + 256*[DSP+1} F6
// STW TOS // TOS = Top of Stack 2B 30
// LD DSP // Get DSP 1A 20
// SUB 02 // Subtract 2 E6 02
// ST DSP // Store back 5E 20
// DEEK // vAC = second value on stack NOS F6
// ADDW TOS // ADD TOS and NOS 99 30
// STW DSP // Store sum to new stack top 2B 20
0x1A,
0x20,
0xF6,
0x2B,
0x30,
0x1A,
0x20,
0xE6,
0x02,
0x5E,
0x20,
0xF6,
0x99,
0x30,
0x2B,
0x20,
*/
#define MEMSIZE 1024 // RAM sized for smallest Arduino
byte M[MEMSIZE] = {
0x1A, // LD DSP DSP = Data stack pointer 0x20 1A 20
0x20,
0xF6, // DEEK vAC = [DSP] + 256*[DSP+1} F6
0x2B, // STW TOS TOS = Top of Stack 2B 30
0x30,
0x1A, // LD DSP Get DSP 1A 20
0x20,
0xE6, // SUB 02 Subtract 2 E6 02
0x02,
0x5E, // ST DSP Store back 5E 20
0x20,
0xF6, // DEEK vAC = second value on stack NOS F6
0x99, // ADDW TOS ADD TOS and NOS 99 30
0x30,
0xF3, // DOKE DSP Store sum to new stack top F3 20
0x20,
0x00, // addr 0x10
0x00,
0x00,
0x00,
0x00,
0x00,
0x00, // addr 0x16 vPC L
0x00, // addr 0x17 vPC H
0x00, // addr 0x18 vAC L
0x00, // addr 0x19 vAC H
0x00, // addr 0x1a vLR L
0x00, // addr 0x1b vLR H
0x00, // addr 0x1c vSP
0x00,
0x00,
0x00,
0x30, // addr 0x20 DSP points to top of stack
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x20, // NOS
0x02,
0x10, // addr 0x30 TOS
0x01,
0x40,
0x04,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
0x00,
};
int vPC;
int vAC;
int vLR;
int addr; // The address
int IR; // The Instruction register
int DSP = 0x20; // Data stack pointer
int TOS;
// byte M; // The contents of memory address pointed to by the PC
byte D; // The data part of the instruction
byte DD; // 2nd byte of data
byte vSP;
int vTmp;
byte vReturn;
void fetch()
{
IR = M[vPC];
D = M[vPC+1]; // get the data
DD = M[vPC+2]; // get 2nd byte of data
vSP = M[0x1c]; // get stack pointer
vPC ++ ;
vPC &= (MEMSIZE-1) ;
}
void execute()
{
int op = IR; // get the opcode
switch (op) {
case 0x00: vTmp = M[DSP]; vTmp =M[vTmp] +256*M[vTmp+1]; Serial.print("TOS="); Serial.println(vTmp, HEX); break; // HALT - and print TOS
case 0x11: vAC = D + 256 * DD; vPC=vPC+2; break; // LDWI $DDDD Load immediate arbitrary constant (vAC=D)
case 0x1A: vAC=M[D]; vPC ++ ; break; // LD $DD Load byte from zero page (vAC=[D])
case 0x5E: M[D]=vAC; vPC ++ ; break; // ST $DD Store byte in zero page ([D]=vAC)
case 0x21: vAC=M[D]+256*M[D+1]; vPC ++ ; break; // LDW $DD Word load from zero page (vAC=[D]+256*[D+1])
case 0x2B: M[D]=vAC&255; M[D+1]=vAC>>8; vPC ++ ; break; // STW $DD Store word into zero page ([D]=vAC&255,[D+1]=vAC>>8)
case 0xEC: M[vSP+D]=vAC&255; M[vSP+D+1]=vAC>>8; vPC ++ ; break; // STLW $DD Store word in stack frame (vSP[D],vSP[D+1]=vAC&255,vAC>>8)
case 0xEE: vAC=M[vSP+D]+256*M[vSP+D+1]; vPC ++ ; break; // LDLW $DD Load word from stack frame (vAC=vSP[D]+256*vSP[D+1])
case 0xAD: vAC=M[vAC]; break; // PEEK - Read byte from memory (vAC=[vAC])
case 0xF0: addr = D + 256*DD; M[addr] = vAC&255; vPC=vPC+2; break; // POKE $DD Write byte in memory ([[D+1],[D]]=vAC&255)
case 0xF6: vAC=M[vAC] +256*M[vAC+1]; break; // DEEK - Read word from memory (vAC=[vAC]+256*[vAC+1])
case 0xF3: addr=M[D]; M[addr]=vAC&255; M[addr+1]=vAC>>8; vPC=vPC+1 ; break; // DOKE $DD Write word in memory ([[D+1],[D]],[[D+1],[D]+1]=vAC&255,vAC>>8)
case 0x93: M[D]= M[D]+ 1; vPC ++ ; break; // INC $DD Increment zero page byte ([D]++)
case 0x59: vAC = D; vPC ++ ; break; // LDI $DD Load immediate small positive constant (vAC=D)
case 0xE3: vAC += D; vPC ++ ; break; // ADDI $DD Add small positive constant (vAC+=D)
case 0xE6: vAC -= D; vPC ++ ; break; // SUBI $DD Subtract small positive constant (vAC-=D)
case 0x82: vAC &= D; vPC ++ ; break; // ANDI $DD Logical-AND with constant (vAC&=D)
case 0x88: vAC |= D; vPC ++ ; break; // ORI $DD Logical-OR with constant (vAC|=D)
case 0x8C: vAC ^= D; vPC ++ ; break; // XORI $DD Logical-XOR with constant (vAC^=D)
case 0x99: vAC+= M[D]+256*M[D+1]; vPC ++ ; break; // ADDW $DD Word addition with zero page (vAC+=[D]+256*[D+1])
case 0xB8: vAC-= M[D]+256*M[D+1]; vPC ++ ; break; // SUBW $DD Word subtraction with zero page (vAC-=[D]+256*[D+1])
case 0xF8: vAC&= M[D]+256*M[D+1]; vPC ++ ; break; // ANDW $DD Word logical-AND with zero page (vAC&=[D]+256*[D+1])
case 0xFA: vAC|= M[D]+256*M[D+1]; vPC ++ ; break; // ORW $DD Word logical-OR with zero page (vAC|=[D]+256*[D+1])
case 0xFC: vAC^= M[D]+256*M[D+1]; vPC ++ ; break; // XORW $DD Word logical-XOR with zero page (vAC^=[D]+256*[D+1])
case 0xE9: vAC+=vAC; break; // LSLW - Shift left (because 'ADDW vAC' will not work!) (vAC+=vAC)
case 0x90: vPC=(vPC&0xff00)+D; break; // BRA $DD Branch unconditionally (vPC=(vPC&0xff00)+D)
case 0x35: break; // BCC $CC $DD Test vAC branch conditionally. CC can be EQ,NE,LT,GT,LE,GE
case 0x3F: if (vAC == 0) {vPC=(vPC&0xff00)+D;} vPC ++ ; break; // EQ
case 0x4D: if (vAC > 0) {vPC=(vPC&0xff00)+D;} vPC ++ ; break; // GT
case 0x50: if (vAC < 0) {vPC=(vPC&0xff00)+D;} vPC ++ ; break; // LT
case 0x53: if (vAC >= 0) {vPC=(vPC&0xff00)+D;} vPC ++ ; break; // GE
case 0x56: if (vAC <= 0) {vPC=(vPC&0xff00)+D;} vPC ++ ; break; // LE
case 0x72: if (vAC != 0) {vPC=(vPC&0xff00)+D;} vPC ++ ; break; // NE
case 0x63: vAC=M[vSP]+256*M[vSP+1]; vSP+=2; break; // POP Pop value from stack (vAC=[vSP]+256*[vSP+1],vSP+=2)
case 0x75: M[--vSP]=vLR&255; M[--vSP]=vLR>>8; break; // PUSH Push vLR on stack ([--vSP]=vLR&255,[--vSP]=vLR>>8)
case 0x7F: break; // LUP $DD ROM lookup (vAC=ROM[D,AC])
case 0xB4: break; // SYS $DD Native function call using at most 2*T cycles, D=270-max(14,T)
case 0xCD: break; // DEF $DD Define data or code (vAC,vPC=vPC+2,D+256*(vPC>>8))
case 0xDF: break; // ALLOC $DD Create or destroy stack frame (vSP+=D)
case 0xCF: break; // CALL $DD Goto address but remember vPC (vLR,vPC=vPC+2,[D]+256*[D+1]-2)
case 0xFF: vPC=vLR-2; break; // RET - Leaf return (vPC=vLR-2)
}
}
void setup() {
Serial.begin(115200);
vPC=0;
vAC=0;
vSP=0x80;
}
void loop() {
while (M[vPC]) {
fetch();
execute();
Serial.print(" vPC="); Serial.print(vPC, HEX); Serial.print(" IR="); Serial.print(IR, HEX); Serial.print(" vAC="); Serial.println(vAC, HEX);
}
for (int i =0; i<=64; i++) {
Serial.print("ADDR=");Serial.print(i, HEX); Serial.print(" DATA=");Serial.println(M[i], HEX);
}
while(1) {} ;
}
Re: Indirect Addressing?
I was pondering, perhaps it's more efficient to use the vCPU stack as the data stack instead of call stack. We have ALLOC to grow/shrink it by any offset, and we have LDLW and STLW for R/W access at offsets in the stack. The call stack doesn't need these operations. It can be simulated by explicit vCPU sequences for push and pop of vLR.
Re: Indirect Addressing?
Marcel,
This sounds like an interesting alternative.
I'll have to look more carefully at ALLOC, LDLW and STLW and get them coded into the simulator.
BTW - I have ordered 13MHz crystal, 10nS 128Kx8 RAMS and SOP32 to DIP32 adaptor boards. I hope to get the expansion board built up and include a fast RAM upgrade.
This sounds like an interesting alternative.
I'll have to look more carefully at ALLOC, LDLW and STLW and get them coded into the simulator.
BTW - I have ordered 13MHz crystal, 10nS 128Kx8 RAMS and SOP32 to DIP32 adaptor boards. I hope to get the expansion board built up and include a fast RAM upgrade.
Re: Indirect Addressing?
A few thoughts:
> The other idea I wish to implement is a set of pseudo registers in zero-page RAM. If vAC is register R0, then I see no reason why further 16-bit registers cannot be implemented in RAM - and given names (eg R1-R15) to make working in assembly language easier.
This is the approach taken by the C compiler. ZP locations 0x30-0x4e are reserved for 15 virtual registers deemed r1 through r15. Fifteen virtual registers may turn out to be too many depending on where things land w.r.t. a calling convention. As it stands, I am treating all registers as callee-save in order to save on space (callee-saves only require handling in function prologs/epilogs; caller-saves typically require handling at each call site). This means that each function has to either save registers ad-hoc (which is fast, but takes extra space) or call through a generic helper (which can be slow, but saves space if more than one register is used by the function).
> I was pondering, perhaps it's more efficient to use the vCPU stack as the data stack instead of call stack. We have ALLOC to grow/shrink it by any offset, and we have LDLW and STLW for R/W access at offsets in the stack. The call stack doesn't need these operations. It can be simulated by explicit vCPU sequences for push and pop of vLR.
IMO we would be better off adding a parameter to ALLOC, LDLW, and STLW that refers to a 2-byte ZP location that is used as the base address for the stack. This would allow e.g. the C compiler to save some time and space when manipulating the stack by using these instructions rather than helper calls. With this approach, ALLOC [D] would add to the stack pointer stored at [D], LDLW [D] would add to the stack pointer stored at [D] and then load the word stored at calculated address into vAC, and STLW [D] would add to the stack pointer stored at [D] and then store the word in vAC at the calculated address.
> The other idea I wish to implement is a set of pseudo registers in zero-page RAM. If vAC is register R0, then I see no reason why further 16-bit registers cannot be implemented in RAM - and given names (eg R1-R15) to make working in assembly language easier.
This is the approach taken by the C compiler. ZP locations 0x30-0x4e are reserved for 15 virtual registers deemed r1 through r15. Fifteen virtual registers may turn out to be too many depending on where things land w.r.t. a calling convention. As it stands, I am treating all registers as callee-save in order to save on space (callee-saves only require handling in function prologs/epilogs; caller-saves typically require handling at each call site). This means that each function has to either save registers ad-hoc (which is fast, but takes extra space) or call through a generic helper (which can be slow, but saves space if more than one register is used by the function).
> I was pondering, perhaps it's more efficient to use the vCPU stack as the data stack instead of call stack. We have ALLOC to grow/shrink it by any offset, and we have LDLW and STLW for R/W access at offsets in the stack. The call stack doesn't need these operations. It can be simulated by explicit vCPU sequences for push and pop of vLR.
IMO we would be better off adding a parameter to ALLOC, LDLW, and STLW that refers to a 2-byte ZP location that is used as the base address for the stack. This would allow e.g. the C compiler to save some time and space when manipulating the stack by using these instructions rather than helper calls. With this approach, ALLOC [D] would add to the stack pointer stored at [D], LDLW [D] would add to the stack pointer stored at [D] and then load the word stored at calculated address into vAC, and STLW [D] would add to the stack pointer stored at [D] and then store the word in vAC at the calculated address.
Re: Indirect Addressing?
Is this not what is currently being done? ALLOC, LDLW and STLW already have a parameter that offsets into the zero page from the value held in vSP (address 0x1C).IMO we would be better off adding a parameter to ALLOC, LDLW, and STLW that refers to a 2-byte ZP location that is used as the base address for the stack. This would allow e.g. the C compiler to save some time and space when manipulating the stack by using these instructions rather than helper calls. With this approach, ALLOC [D] would add to the stack pointer stored at [D], LDLW [D] would add to the stack pointer stored at [D] and then load the word stored at calculated address into vAC, and STLW [D] would add to the stack pointer stored at [D] and then store the word in vAC at the calculated address.
Are you suggesting a more general case where we are not tied to a fixed vSP - but indexed off any zeropage location.
Question for Marcel - when using ALLOC, is $DD a signed integer so that $DD=0x02 increments the stack pointer by 2 and $DD=0xFE decrements it by 2?
Re: Indirect Addressing?
Correct, because vSP is a single-byte register.
Programs typically park values on the stack with these instructions. They were originally squeezed in to make the recursive Search() function in Queens.gcl possible. This function doesn't manipulate the stack variables directly, but uses the stack to save and restore zero page variables. These zero page variables are then in turn used as if they are locals.
While ALLOC has some wiggle room for patching, both LDLW and STLW are at 26 cycles already. The limit for vCPU instructions is 28 cycles before they must become SYS extensions. So I fear this vCPU instruction set is pretty much what it is. But there should be possibilities to add new vCPU architectures (and perhaps even do cooperative multithreading between them).
I briefly checked if it's possible to make the address of vSP itself a variable (not really sure if that helps). But I don't immediately see how to do that in 2 cycles.
Re: Indirect Addressing?
Staring at LCC's generated code, and looking at vCPU again, I think we could squeeze in one or two (or three) new vCPU instructions in a new ROM, if we really want, without breaking compatibility with existing vCPU programs.
At first glance, the existing ANDI and INC can each be patched to provide the landing space for new opcodes. This at the expense of slowing down the originals by 6 cycles (0.96 µs) because they must be rerouted to another ROM page and back. (ALLOC looks a bit too short for patching BTW, but maybe...).
As one new candidate instruction “THUNK $DD” comes to mind: basically a “BRA” into the next code page, replacing Pat's thunk functions at the end of each segment. That makes page hopping much faster and frees up some zero page real estate. It can also replace “LDWI $DDDD / CALL vAC” in many cases, saving 3 bytes (and not clobbering vAC).
At first glance, the existing ANDI and INC can each be patched to provide the landing space for new opcodes. This at the expense of slowing down the originals by 6 cycles (0.96 µs) because they must be rerouted to another ROM page and back. (ALLOC looks a bit too short for patching BTW, but maybe...).
As one new candidate instruction “THUNK $DD” comes to mind: basically a “BRA” into the next code page, replacing Pat's thunk functions at the end of each segment. That makes page hopping much faster and frees up some zero page real estate. It can also replace “LDWI $DDDD / CALL vAC” in many cases, saving 3 bytes (and not clobbering vAC).