New vCPU instructions 2.0

Using, learning, programming and modding the Gigatron and anything related.
Forum rules
Be nice. No drama.
lb3361
Posts: 109
Joined: 17 Feb 2021, 23:07

Re: New vCPU instructions 2.0

Post by lb3361 »

I just ran some experimented with a batch of indirect-indexed instructions.

The encoding is as follows:

PREFIX VAR OPCODE OFFSET

where PREFIX = $B1 (which is at67's PREFX1) and OPCODE is one of LD/LDW/ST/STW/ADDW/SUBW/ANDW/ORW/XORW. Instead of accessing a 16 bit variable at address OFFSET in page zero, these instructions now use [ [VAR] + OFFSET ]. This is useful in the C compiler to access local variables allocated on the stack, -- e.g, LDW([SP,offset]) -- and also to access fields in a structure pointed by a register variable, -- e.g., ANDW( [StructPtr, FieldOffset] ).

This comes at a cost of an additional 42-44 cycles which can be split in various ways (the PREFIX instruction does the full address calculation if it has enough time, otherwise it delegates the addition to a restart. Once the address is computed (stored in vLR), a final restart runs the actual instruction.) This overhead is quite good because it is the same as computing the address with LDI(offset);ADDW(var). The code size benefit is quite small with LD/LDW because one could do LDI(offset);ADDW(var);PEEK/DEEK() but much more significant with STW or ADDW because one replaces things like LDI(offset);ADDW(var);STW(tmpvar); <compute-something-in-vAC> ; DOKE(tmpvar) by a simple <compute-something-in-vAC> STW([var,offset]).

The total gain with the C compiler is about 3-5% extra reduction with respect to at67's new instruction set. This is smaller than I expected because the C compiler often finds a way to use DEEKA/DOKEA/DEEKV/DOKE relatively efficiently and aggressively promotes local variables to registers. When it fails to promote, it resorts to using stack variables in a manner that costs a lot of opcodes. So indirect-indexed addressing helps a lot there. But when the compiler works well, or when the programmer uses the keyword 'register' smartly, the gain is more limited.

Another question is the potential gain with respect to the v5a instruction set. Without the competition of DOKEA/DEEKA/DEEKV, the benefits of indirect-indexed addressing is a lot more obvious.

Overall I believe this is a good idea. The implementation might have to be refined. In particular I am not sure at67 would like the idea of completely taking over the PREFX1 instruction page for just 8 instructions. I need to sleep over this...

After mulling these results, I concluded that this would be a nice improvement over rom v5a, but a much less compelling one over at67's rom, once released.
at67
Posts: 435
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

lb3361 wrote: 07 Jul 2021, 02:51 I just ran some experimented with a batch of indirect-indexed instructions.

The encoding is as follows:

PREFIX VAR OPCODE OFFSET
I'm going to use this format, (as you suggested), for PREFX3 to save a few cycles.
lb3361 wrote: 07 Jul 2021, 02:51 Overall I believe this is a good idea. The implementation might have to be refined. In particular I am not sure at67 would like the idea of completely taking over the PREFX1 instruction page for just 8 instructions. I need to sleep over this...

After mulling these results, I concluded that this would be a nice improvement over rom v5a, but a much less compelling one over at67's rom, once released.
We could just move one of the page3 instructions that is infrequently used, (like i did with SEXT), and create a new PREFX instruction page that supports this format and performs the offset calculation as part of PREFX, (if possible). If this is not possible or if the page wastage is too great, then using the modified PREFX3, (as above), may be an alternative.

P.S. I predict a lot of potential new instructions that could use a signed 8 bit offset, so I think there would eventually be a lot more than 8, making the page wastage a moot point hopefully.
at67
Posts: 435
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

Update:

I've updated and/or added the following instructions to ROMvX0, I've set myself a deadline of releasing the ROM by next weekend.

PAGE3
  • LSRB <var>, logical shift right on a zero page byte var, 28 cycles.
  • LSRV <var>, logical shift right on a zero page word var, 52 cycles.
  • LSLV <var>, logical shift left on a zero page word var, 28 cycles.
  • ADDVI <var>, <imm>, add 8bit immediate to 16bit zero page var, var += imm, vAC = var, 50 cycles.
  • SUBVI <var>, <imm>, subtract 8bit immediate from 16bit zero page var, var -= imm, vAC = var, 50 cycles.
  • ADDVW <var dst>, <var src>, add 16bit zero page vars, dst += src, vAC = dst, 54 cycles.
  • SUBVW <var dst>, <var src>, subtract 16bit zero page vars, dst -=src, vAC = dst, 54 cycles.
  • DJNE <var>, <16bit imm>, decrement word var and jump if not equal to zero, 46 cycles
  • DJGE <var>, <16bit imm>, decrement word var and jump if greater than or equal to zero, 42 cycles
PREFX1
  • NOTE, vAC = ROM:[NotesTable + vAC.lo*2], 22 + 28 cycles.
  • MIDI, vAC = ROM:[NotesTable + (vAC.lo - 11)*2], 22 + 30 cycles.
PREFX2
  • LSLN <imm n>, vAC <<= n, (16bit shift), 22 + 30*n + 20 cycles.
  • FREQM <var chan>, [(((chan & 3) + 1) <<8) | 0x00FC] = vAC, chan = [0..3], 22 + 26 cycles.
  • FREQA <var chan>, [((((chan - 1) & 3) + 1) <<8) | 0x00FC] = vAC, chan = [1..4], 22 + 26 cycles.
  • FREQZ <imm chan>, [(((chan & 3) + 1) <<8) | 0x00FC] = 0, chan = [0..3], 22 + 22 cycles.
  • VOLM <var chan>, [(((chan & 3) + 1) <<8) | 0x00FA] = vAC.low, chan = [0..3], 22 + 24 cycles.
  • VOLA <var chan>, [((((chan - 1) & 3) + 1) <<8) | 0x00FA] = 63 - vAC.low + 64, chan = [1..4], 22 + 26 cycles.
  • MODA <var chan>, [((((chan - 1) & 3) + 1) <<8) | 0x00FB] = vAC.low, chan = [1..4], 22 + 24 cycles.
  • MODZ <imm chan>, [(((imm & 3) + 1) <<8) | 0x00FA] = 0x0200, imm = [0..3], 22 + 24 cycles.
  • SMPCPY <var addr>, copies 64 packed 4bit samples from [vAC] to the interlaced address in addr, vAC += 32, 22 + 31*58 + 52 cycles, (if vAC overflows a 256 byte boundary then 22 + 30*58 + 60 + 52 cycles).
  • CMPWS <var>, vAC = vAC CMPWS var, combines CMPHS and SUBW into one instruction, 22 + 46 cycles.
  • CMPWU <var>, vAC = vAC CMPWU var, combines CMPHU and SUBW into one instruction, 22 + 46 cycles.
  • LEEKA <var>, var[0..3] = PEEK([vAC+0...vAC+3]), peeks a long from [vAC] to [var], 22 + 44 cycles.
  • LOKEA <var>, POKE vAC[0..3], var[0..3], pokes a long from [var] to [vAC], 22 + 44 cycles.
  • FEEKA <var>, var[0..4] = PEEK([vAC+0...vAC+4]), peeks a float, (5 bytes), from [vAC] to [var], 22 + 48 cycles.
  • FOKEA <var>, POKE vAC[0..4], var[0..4], pokes a float, (5 bytes), from [var] to [vAC], 22 + 48 cycles.
  • MEEKA <var>, var[0..7] = PEEK([vAC+0...vAC+7]), peeks 8 bytes from [vAC] to [var], 22 + 64 cycles.
  • MOKEA <var>, POKE vAC[0..7], var[0..7], pokes 8 bytes from [var] to [vAC], 22 + 64 cycles.
PREFX3
  • STB2 <16bit imm>, store vAC.lo into 16bit immediate address, 22 + 20 cycles.
  • STW2 <16bit imm>, store vAC into 16bit immediate address, 22 + 22 cycles.
  • XCHGB <var0>, <var1>, exchange two zero byte variables, 22 + 28 cycles.
  • ADDWI <16bit imm>, vAC += immediate 16bit value, 22 + 28 cycles.
  • SUBWI <16bit imm>, vAC -= immediate 16bit value, 22 + 28 cycles.
  • ANDWI <16bit imm>, vAC &= immediate 16bit value, 22 + 22 cycles.
  • ORWI <16bit imm>, vAC |= immediate 16bit value, 22 + 22 cycles.
  • XORWI <16bit imm>, vAC ^= immediate 16bit value, 22 + 22 cycles.
  • LDPX, <var addr>, <colour var>, load pixel, <addr>, <colour>, 22 + 30 cycles, (respects VTable).
  • STPX, <var addr>, <colour var>, store pixel, <addr>, <colour>, 22 + 30 cycles, (respects VTable).
  • CONDI, <imm0>, <imm1>, chooses immediate operand based on condition, (vAC == 0), 22 + 26 cycles.
  • CONDB, <var0 byte>, <var1 byte>, chooses byte variable based on condition, (vAC == 0), 22 + 26 cycles.
  • CONDIB, <imm0>, <var byte>, chooses between immediate operand and byte variable based on condition, (vAC == 0), 22 + 26 cycles.
  • CONDBI, <var byte>, <imm0>, chooses between byte variable and immediate operand based on condition, (vAC == 0), 22 + 26 cycles.
  • XCHGW, <var0>, <var1>, exchanges two zero page word variables, 22 + 46 cycles, (destroys vAC).
  • SWAPB, <var0 addr>, <var1 addr>, swaps two bytes in memory, 22 + 46 cycles.
  • SWAPW, <var0 addr>, <var1 addr>, swaps two words in memory, 22 + 58 cycles.
  • NEEKA <var addr>, <imm n>, var[0..n] = PEEK([vAC+0...vAC+n]), peeks n bytes from [vAC] to [var], 22 + 34*n + 24 cycles.
  • NOKEA <var addr>, <imm n>, POKE vAC[0..n], var[0..n], pokes n bytes from [var] to [vAC], 22 + 34*n + 24 cycles.
  • OSCPX <var wave addr>, <var index>, read sample from wave-table address and format it into a screen pixel at address in [vAC], 22 + 42 cycles.
In total I've added almost 100 new instructions and around 30 new SYS calls for supporting sprites, scrolling, arithmetic, memory transfers, sorting, etc.
Last edited by at67 on 17 Oct 2021, 11:27, edited 4 times in total.
Hans61
Posts: 48
Joined: 29 Dec 2020, 16:15
Location: Saxonia
Contact:

Re: New vCPU instructions 2.0

Post by Hans61 »

great work, thanks!
bmwtcu
Posts: 65
Joined: 01 Nov 2018, 12:02

Re: New vCPU instructions 2.0

Post by bmwtcu »

Nice! Looking forward to it!
at67
Posts: 435
Joined: 14 May 2018, 08:29

Re: New vCPU instructions 2.0

Post by at67 »

Update:

I've updated/added the following instructions to ROMvX0.

PAGE3
  • CMPHS: Reinstated.
  • CMPHU: Reinstated.
  • LOKEI: Loke immediate long into address contained in [vAC], 42 cycles, (5 byte instruction).
PREFX2
  • LSLVL: Logical shift left var long, 22 + 56 cycles
  • LSRVL: Logical shift right var long, 22 + 104 cycles
PREFX3
  • ADDVL: Add two 32bit zero page vars, dst += src, 22 + 78 cycles
  • SUBVL: Subtract two 32bit zero page vars, dst -= src, 22 + 74 cycles
  • ANDVL: And two 32bit zero page vars, dst &= src, 22 + 46 cycles
  • ORVL: Or two 32bit zero page vars, dst |= src, 22 + 46 cycles
  • XORVL: Xor two 32bit zero page vars, dst ^= src, 22 + 46 cycles
  • JCCL: Jump to address based on long CC, (address of long in vAC), 22 + (40 to 44) cycles
Post Reply