lb3361 wrote: ↑10 Mar 2021, 12:09
I was mostly thinking about what's needed to resurrect the C compiler to be honest. The main issues for a C compiler are to make sure we can do move things around easily, and in particular easily enough to deal with real stack frames. Otherwise the needs are similar to those of Basic I believe.
Personally I have no technical interest in the C compiler, I think it's a very poor fit for the Gigatron's default memory map and limited resources, but that doesn't mean I don't want it to succeed. The more options we have for software development on the Gigatron, the better, so if someone is willing to take up the mantle and see it through to completion and I can make their life a little easier, then I am up for that.
lb3361 wrote: ↑10 Mar 2021, 12:09
Summary
If you had to pick only one of my proposals in addition to what you already have planned, please do the MOVWA one (the word version of your MOVBA). Next is MOVW (the word version of your MOVB). I am sure you'll find them useful for Basic as well.
MOVWA is possible in 30 ticks or less outside of page3, I can add that one, (but I think you're understanding of how it works is backwards, it replaces PEEK, not POKE).
MOVW is just not possible in any other page other than page3, (MOVB takes 28 cycles outside of page3), and due to the current vCPU opcode usage, (branch target of the vCPU interpreter), the only way to find 28 words of space for a page3, 28 cycle instruction, is to move one of the 28 cycle instructions like ADDW or SUBW out of page3, (without breaking vCPU opcode compatibility).
Initially I was transfixed on having an XCHG r0, r1 instruction, (so that I could replace 6 instructions with 1 when swapping byte variables and 6 with 2 when swapping word variables), to make this work I had to increase maxticks from 28 to 32, move ADDW from page3, (28 cycles), to external page, (32 cycles), so that I could just barely fit a 28 cycle XCHG into page3.
It wasn't worth it, all code ran around 15% slower because of the maxticks change because there are less vCPU slots available per scanline as a result of the firmware's simple, (but fast), vCPU slot allocation mechanism.
In the end I settled on maxticks=30 as a good compromise between speed and new instruction feature expandability, this allowed me to move almost all of the instructions out of page3 with either 2 or 3 instruction prologues and 3 instruction instead of 2 instruction epilogues, (the important instructions, ADDW/SUBW/BCC etc, can't be moved).
The prologues and epilogues add an extra burden of either 3 or 4 cycles to your cycle allocation and if you create an instruction with 2 operands, (like MOVB), you have to add extra cycles to parse the 2nd operand and then another 3 cycles to fix vPC, all of these extra cycle allocations before you've even begun to implement your instruction, (which must fit in a max of 30 cycles).
I actually spent the first week trying to re-write the entire vCPU interpreter so that all opcodes had an extra level of indirection, so that page 3 was purely used as a jump table and contained no actual instruction code, I also tried to remove the automatic +2 to vPC within the vCPU dispatch and have each instruction do the vCPU fixup itself, which would now be possible because page3 space limitations were no longer a factor, (given the above).
Both these ideas require a complete rethink of how the vCPU interpreter works and I am not sure they are even possible, (whilst retaining 100% software compatibility), i.e. apps, emulators, code all rely on the -2 vPC fixup as vPC is pre-incremented by 2 each vCPU dispatch.
lb3361 wrote: ↑10 Mar 2021, 12:09
1 - Moving things around without damaging vAC.
This was one of my main goals in creating new instructions.
lb3361 wrote: ↑10 Mar 2021, 12:09
Modern code generators need registers. The only way to do so in the gigatron is to reserve a part of the page zero and call them "registers". Then we need to move things around. I believe the operations we need most are the following. I use the prefix MOV for all instructions that move things from/into zero page variables, keeping LD and ST for the ones that affect vAC, but that's purely cosmetic.
I like this formality a lot, I will adopt it and rename the affected instructions appropriately.
lb3361 wrote: ↑10 Mar 2021, 12:09
2- Instruction to implement indexed addressing.
Here was another area I spent a considerable amount of time on, I tried everything I could think of to come up with some sort of useful indexing, (array instruction), none of my ideas were possible outside of page3 with maxticks=30 and given the cycle limitations noted above.
An optimised ADDW in page3 takes 28 cycles, you could get that down to 26 cycles in page 3, by having two separate code paths for the carry and borrow code, but once again there is no room if you wish to remain opcode compatible, (it's a shame that the hardware carry/borrow is thrown away and not made accessible to native code, having to calculate it in SW each time you need it, really stings).
So 16bit arithmetic outside of page 3 for indexed modes is not possible, indexed modes with 8 bit operands are possible and I initially implemented an instruction to do so and found it mostly useless in my experiments, I'm sure in a bespoke application it could redeem itself.
lb3361 wrote: ↑10 Mar 2021, 12:09
Pretty much all modern CPUs rely on loads and stores with indexed addressing. For us this means true 16 bits address calculation. I believe it is okay to explicitly compute addresses in vAC. For the equivalent of a load from disp(r1) to r2, we can do "LDWI disp; ADDW r1; DEEK; ST r2'. But the equvalent of a store from r2 to disp(r1) is very inconvenient: "LDWI disp; ADDW r1; STW tmp; LDW r2; DOKE". Basically POKE and DOKE work backwards. Instead of instructions to store vAC at addresses found in page zero, one needs instructions to store something found in page zero at the address contained in vAC.
Just a quick addition for anyone following along, I assume you meant:
Code: Select all
LDWI disp
ADDW r1
STW tmp
LDW r2
DOKE tmp
lb3361 wrote: ↑10 Mar 2021, 12:09
Code: Select all
MOVBA r1 Replaces STW tmp; LD r1; POKE tmp Your MOVBA in fact
MOVWA r1 Replaces STW tmp; LDW r1; DOKE tmp Painful to do with two MOVBAs. ==> Please consider this one!
This is where I think the misunderstanding is, my 'MOVBA tmp' replaces 'PEEK, ST tmp', it doesn't replace the yukky POKE example you gave above, but we could add a POKEA.
This would change this instruction carnage:
Code: Select all
LDWI disp
ADDW r1
STW tmp
LDW r2
POKE tmp
To the more managable:
lb3361 wrote: ↑10 Mar 2021, 12:09
3- True 16 bits stack
I can see two ways to go.
- The first approach tries to use LDLW/STLW as much as possible. When allocating a stack frame, the function prologue must now detect that
- The second approach is to totally ignore the VCPU stack and instead implement a new one. This is made easier by the instructions that help
I believe that solution (1) would be a bit faster but much more complicated. If we completely ignore vSP/vSPH, solution (2) does not require any new opcodes. If we want to still use vSP/vSPH to maximize interoperability, we might consider an opcode that computes a local variable address like this:
Code: Select all
LDLA imm Replaces MOVB vSP,vAC, MOVB vSPH,vAC+1, ADDI imm
then rely on PEEK/DEEK/MOVBA/MOVWA to access the variable themselves.
I like option 2, for the BASIC compiler a 256 byte stack page is more than enough, the only real stack requirement it has, is for CALL/CALLI return's and thus far I have been making do with 16 bytes for that, (and 8 bytes for parameters and local variables within procedures in a separate part of zero page), so 256 bytes of stack space is like flying space unicorns farting out gold coins into my lap.
You could also keep the vSP/vSPH/256 byte stack for hardware CALL's, (128 levels of nested functions/recursion, probably not be enough for diehard C coders), and implement the rest of the stack functionality as a SW stack as you suggested.