Page 10 of 10

Re: New vCPU instructions 2.0

Posted: 07 Jul 2021, 02:51
by lb3361
I just ran some experimented with a batch of indirect-indexed instructions.

The encoding is as follows:

PREFIX VAR OPCODE OFFSET

where PREFIX = $B1 (which is at67's PREFX1) and OPCODE is one of LD/LDW/ST/STW/ADDW/SUBW/ANDW/ORW/XORW. Instead of accessing a 16 bit variable at address OFFSET in page zero, these instructions now use [ [VAR] + OFFSET ]. This is useful in the C compiler to access local variables allocated on the stack, -- e.g, LDW([SP,offset]) -- and also to access fields in a structure pointed by a register variable, -- e.g., ANDW( [StructPtr, FieldOffset] ).

This comes at a cost of an additional 42-44 cycles which can be split in various ways (the PREFIX instruction does the full address calculation if it has enough time, otherwise it delegates the addition to a restart. Once the address is computed (stored in vLR), a final restart runs the actual instruction.) This overhead is quite good because it is the same as computing the address with LDI(offset);ADDW(var). The code size benefit is quite small with LD/LDW because one could do LDI(offset);ADDW(var);PEEK/DEEK() but much more significant with STW or ADDW because one replaces things like LDI(offset);ADDW(var);STW(tmpvar); <compute-something-in-vAC> ; DOKE(tmpvar) by a simple <compute-something-in-vAC> STW([var,offset]).

The total gain with the C compiler is about 3-5% extra reduction with respect to at67's new instruction set. This is smaller than I expected because the C compiler often finds a way to use DEEKA/DOKEA/DEEKV/DOKE relatively efficiently and aggressively promotes local variables to registers. When it fails to promote, it resorts to using stack variables in a manner that costs a lot of opcodes. So indirect-indexed addressing helps a lot there. But when the compiler works well, or when the programmer uses the keyword 'register' smartly, the gain is more limited.

Another question is the potential gain with respect to the v5a instruction set. Without the competition of DOKEA/DEEKA/DEEKV, the benefits of indirect-indexed addressing is a lot more obvious.

Overall I believe this is a good idea. The implementation might have to be refined. In particular I am not sure at67 would like the idea of completely taking over the PREFX1 instruction page for just 8 instructions. I need to sleep over this...

After mulling these results, I concluded that this would be a nice improvement over rom v5a, but a much less compelling one over at67's rom, once released.

Re: New vCPU instructions 2.0

Posted: 19 Jul 2021, 09:51
by at67
lb3361 wrote: 07 Jul 2021, 02:51 I just ran some experimented with a batch of indirect-indexed instructions.

The encoding is as follows:

PREFIX VAR OPCODE OFFSET
I'm going to use this format, (as you suggested), for PREFX3 to save a few cycles.
lb3361 wrote: 07 Jul 2021, 02:51 Overall I believe this is a good idea. The implementation might have to be refined. In particular I am not sure at67 would like the idea of completely taking over the PREFX1 instruction page for just 8 instructions. I need to sleep over this...

After mulling these results, I concluded that this would be a nice improvement over rom v5a, but a much less compelling one over at67's rom, once released.
We could just move one of the page3 instructions that is infrequently used, (like i did with SEXT), and create a new PREFX instruction page that supports this format and performs the offset calculation as part of PREFX, (if possible). If this is not possible or if the page wastage is too great, then using the modified PREFX3, (as above), may be an alternative.

P.S. I predict a lot of potential new instructions that could use a signed 8 bit offset, so I think there would eventually be a lot more than 8, making the page wastage a moot point hopefully.

Re: New vCPU instructions 2.0

Posted: 17 Oct 2021, 06:02
by at67
Update:

I've updated and/or added the following instructions to ROMvX0, I've set myself a deadline of releasing the ROM by next weekend.

PAGE3
  • LSRB <var>, logical shift right on a zero page byte var, 28 cycles.
  • LSRV <var>, logical shift right on a zero page word var, 52 cycles.
  • LSLV <var>, logical shift left on a zero page word var, 28 cycles.
  • ADDVI <var>, <imm>, add 8bit immediate to 16bit zero page var, var += imm, vAC = var, 50 cycles.
  • SUBVI <var>, <imm>, subtract 8bit immediate from 16bit zero page var, var -= imm, vAC = var, 50 cycles.
  • ADDVW <var dst>, <var src>, add 16bit zero page vars, dst += src, vAC = dst, 54 cycles.
  • SUBVW <var dst>, <var src>, subtract 16bit zero page vars, dst -=src, vAC = dst, 54 cycles.
  • DJNE <var>, <16bit imm>, decrement word var and jump if not equal to zero, 46 cycles
  • DJGE <var>, <16bit imm>, decrement word var and jump if greater than or equal to zero, 42 cycles
PREFX1
  • NOTE, vAC = ROM:[NotesTable + vAC.lo*2], 22 + 28 cycles.
  • MIDI, vAC = ROM:[NotesTable + (vAC.lo - 11)*2], 22 + 30 cycles.
PREFX2
  • LSLN <imm n>, vAC <<= n, (16bit shift), 22 + 30*n + 20 cycles.
  • FREQM <var chan>, [(((chan & 3) + 1) <<8) | 0x00FC] = vAC, chan = [0..3], 22 + 26 cycles.
  • FREQA <var chan>, [((((chan - 1) & 3) + 1) <<8) | 0x00FC] = vAC, chan = [1..4], 22 + 26 cycles.
  • FREQZ <imm chan>, [(((chan & 3) + 1) <<8) | 0x00FC] = 0, chan = [0..3], 22 + 22 cycles.
  • VOLM <var chan>, [(((chan & 3) + 1) <<8) | 0x00FA] = vAC.low, chan = [0..3], 22 + 24 cycles.
  • VOLA <var chan>, [((((chan - 1) & 3) + 1) <<8) | 0x00FA] = 63 - vAC.low + 64, chan = [1..4], 22 + 26 cycles.
  • MODA <var chan>, [((((chan - 1) & 3) + 1) <<8) | 0x00FB] = vAC.low, chan = [1..4], 22 + 24 cycles.
  • MODZ <imm chan>, [(((imm & 3) + 1) <<8) | 0x00FA] = 0x0200, imm = [0..3], 22 + 24 cycles.
  • SMPCPY <var addr>, copies 64 packed 4bit samples from [vAC] to the interlaced address in addr, vAC += 32, 22 + 31*58 + 52 cycles, (if vAC overflows a 256 byte boundary then 22 + 30*58 + 60 + 52 cycles).
  • CMPWS <var>, vAC = vAC CMPWS var, combines CMPHS and SUBW into one instruction, 22 + 46 cycles.
  • CMPWU <var>, vAC = vAC CMPWU var, combines CMPHU and SUBW into one instruction, 22 + 46 cycles.
  • LEEKA <var>, var[0..3] = PEEK([vAC+0...vAC+3]), peeks a long from [vAC] to [var], 22 + 44 cycles.
  • LOKEA <var>, POKE vAC[0..3], var[0..3], pokes a long from [var] to [vAC], 22 + 44 cycles.
  • FEEKA <var>, var[0..4] = PEEK([vAC+0...vAC+4]), peeks a float, (5 bytes), from [vAC] to [var], 22 + 48 cycles.
  • FOKEA <var>, POKE vAC[0..4], var[0..4], pokes a float, (5 bytes), from [var] to [vAC], 22 + 48 cycles.
  • MEEKA <var>, var[0..7] = PEEK([vAC+0...vAC+7]), peeks 8 bytes from [vAC] to [var], 22 + 64 cycles.
  • MOKEA <var>, POKE vAC[0..7], var[0..7], pokes 8 bytes from [var] to [vAC], 22 + 64 cycles.
PREFX3
  • STB2 <16bit imm>, store vAC.lo into 16bit immediate address, 22 + 20 cycles.
  • STW2 <16bit imm>, store vAC into 16bit immediate address, 22 + 22 cycles.
  • XCHGB <var0>, <var1>, exchange two zero byte variables, 22 + 28 cycles.
  • ADDWI <16bit imm>, vAC += immediate 16bit value, 22 + 28 cycles.
  • SUBWI <16bit imm>, vAC -= immediate 16bit value, 22 + 28 cycles.
  • ANDWI <16bit imm>, vAC &= immediate 16bit value, 22 + 22 cycles.
  • ORWI <16bit imm>, vAC |= immediate 16bit value, 22 + 22 cycles.
  • XORWI <16bit imm>, vAC ^= immediate 16bit value, 22 + 22 cycles.
  • LDPX, <var addr>, <colour var>, load pixel, <addr>, <colour>, 22 + 30 cycles, (respects VTable).
  • STPX, <var addr>, <colour var>, store pixel, <addr>, <colour>, 22 + 30 cycles, (respects VTable).
  • CONDI, <imm0>, <imm1>, chooses immediate operand based on condition, (vAC == 0), 22 + 26 cycles.
  • CONDB, <var0 byte>, <var1 byte>, chooses byte variable based on condition, (vAC == 0), 22 + 26 cycles.
  • CONDIB, <imm0>, <var byte>, chooses between immediate operand and byte variable based on condition, (vAC == 0), 22 + 26 cycles.
  • CONDBI, <var byte>, <imm0>, chooses between byte variable and immediate operand based on condition, (vAC == 0), 22 + 26 cycles.
  • XCHGW, <var0>, <var1>, exchanges two zero page word variables, 22 + 46 cycles, (destroys vAC).
  • SWAPB, <var0 addr>, <var1 addr>, swaps two bytes in memory, 22 + 46 cycles.
  • SWAPW, <var0 addr>, <var1 addr>, swaps two words in memory, 22 + 58 cycles.
  • NEEKA <var addr>, <imm n>, var[0..n] = PEEK([vAC+0...vAC+n]), peeks n bytes from [vAC] to [var], 22 + 34*n + 24 cycles.
  • NOKEA <var addr>, <imm n>, POKE vAC[0..n], var[0..n], pokes n bytes from [var] to [vAC], 22 + 34*n + 24 cycles.
  • OSCPX <var wave addr>, <var index>, read sample from wave-table address and format it into a screen pixel at address in [vAC], 22 + 42 cycles.
In total I've added almost 100 new instructions and around 30 new SYS calls for supporting sprites, scrolling, arithmetic, memory transfers, sorting, etc.

Re: New vCPU instructions 2.0

Posted: 17 Oct 2021, 06:51
by Hans61
great work, thanks!

Re: New vCPU instructions 2.0

Posted: 17 Oct 2021, 15:38
by bmwtcu
Nice! Looking forward to it!

Re: New vCPU instructions 2.0

Posted: 21 Oct 2021, 18:24
by at67
Update:

I've updated/added the following instructions to ROMvX0.

PAGE3
  • CMPHS: Reinstated.
  • CMPHU: Reinstated.
  • LOKEI: Loke immediate long into address contained in [vAC], 42 cycles, (5 byte instruction).
PREFX2
  • LSLVL: Logical shift left var long, 22 + 56 cycles
  • LSRVL: Logical shift right var long, 22 + 104 cycles
PREFX3
  • ADDVL: Add two 32bit zero page vars, dst += src, 22 + 78 cycles
  • SUBVL: Subtract two 32bit zero page vars, dst -= src, 22 + 74 cycles
  • ANDVL: And two 32bit zero page vars, dst &= src, 22 + 46 cycles
  • ORVL: Or two 32bit zero page vars, dst |= src, 22 + 46 cycles
  • XORVL: Xor two 32bit zero page vars, dst ^= src, 22 + 46 cycles
  • JCCL: Jump to address based on long CC, (address of long in vAC), 22 + (40 to 44) cycles