New vCPU instructions? (poll)

marcelk · Post by **marcelk** » 10 Jul 2019, 13:00

We can squeeze in at least two new vCPU instructions without breaking compatibility of existing GT1 files. We can do this by diverting INC and ALLOC, after entry, to another page and continue there. This makes them 6 cycles slower in total, but we can reuse the won space as the landing spot for new instructions. Not every instruction is suitable as a space donor this way, because their cycle count must not exceed 28.

Candidates for new instructions are (in alphabetical order):

CMPW $DD
Full range 16-bit comparison between vAC and the word at address $DD. With this C programs gain proper < > <= and >= without overflowing at 15-bits (issue #64)

HOP $DD
Branch into next page to offset $DD(+2). This makes longer functions possible, and C programs faster. It eliminates (some) hassle dealing with 256/96-byte segments in vCPU code.

LSRW or LSR
Shift word right 1 bit without going through a SYS function. It's unclear if we can do this in 28 cycles. Otherwise just LSR (byte)? Mind that we already have SYS functions for shifting words right by 1..8 bits. And multiplication and division routines work well without right-shifting (see for example TinyBASIC).

SW02 $DD
SWEET02: Switch to v6502 mode immediately. So inlining 6502 code without setting vLR first and going through a SYS function. After BRK (#0) continue in vCPU mode at page offset $DD(+2). In GCL notation it looks this this:

Code: Select all

...vCPUcode... [SW02 ....v6502code... #0] ...vCPUcode ...

Perhaps this can also be repurposed as a somewhat less efficient page hopping mechanism than HOP $DD, but I haven't thought it through very well. Something like:

Code: Select all

 ...vCPUcode... [SW02 #0
 *=$300
 ] ...vCPUcode...

Feedback is welcome, hence the poll.

Cwiiis · Post by **Cwiiis** » 10 Jul 2019, 16:51

I've been a bit out of the gigatron loop, but trying to keep track via GitHub. I've voted for CMPW and HOP, and these seem by far to be the more useful of all these functions, especially considering what they alow for C programs. Of the limited amount of gigatron coding I've done, the biggest barriers have been dealing with small contiguous memory segments and copying large chunks of memory - anything that could help with either of those would be great in my opinion.

cmpxchg · Post by **cmpxchg** » 13 Jul 2019, 18:24

I would like a bit more guidance to make an informed decision on what (two) items to vote in the poll

Knowing the expected instruction mix in existing vCPU programs and added value, speedups for example.

1.
The concept of a vCPU interface was known to me when I put my unit together, but I was unaware which instructions it *exactly* contained. Perhaps this could be in a manual form as well, instead of a tutorial. I did find an 'external' link, to github
https://raw.githubusercontent.com/kervi ... ummary.txt
and the more elaborate (but technically not defining a binary interface, but higher-level language interface)
https://raw.githubusercontent.com/kervi ... nguage.txt

Is this the complete document that defines 'vCPU binary compatibility' ?
- it is very terse, are all opcode/mnemonic arguments 16 bit or sometimes also 8 bit ?
- the width and signed/unsignedness of each register, and other state inside the vCPU
- method/library/call interface/ROM offsets for more complex functionality 'BIOS calls/software interrupts', framebuffer, character set
- layout for conditions, link register, leaf call concepts, other new internal state description

perhaps better formatting in a table and releasing it as a PDF like other CPU instructionsets makes it easier to understand. At first glance it seems hardly more functional than the native hardware CPU instructionset, except 16 bit

- but still having the same-ROM page limitations everywhere.

2.
Then, the vCPU binary compatibility is technically not needed - the hardware is executing the 128 kbyte ROM code - but since it is currently not user-programmable (no ROM emulators in wide usage), I understand your wish of defining a new variant to having 'binary compatibility' for vCPU programs loaded and interpreted in the 32 kbyte RAM, using the vCPU in ROM. Perhaps also to maintain pluggyMcPlugface compatibility.

3.
Are there any suggestions for suitable off-the-shelf EPROM emulators for users that want to run their own ROMs and experiment at that level ?

One could also think of to NOT standardise the vCPU interface, but instead provide a higher-level abstraction, like the BASIC syntax and ways to make 'hooks' (dummy calls to intercept program flow) to add new tokenized basic instructions in... a variant of vCPU language matching the ROM inserted in the unit. DISK ROM basic did use that back in the old micro days, having a set of hooks as an interface.

marcelk · Post by **marcelk** » 13 Jul 2019, 21:27

cmpxchg wrote: ↑13 Jul 2019, 18:24 I would like a bit more guidance to make an informed decision on what (two) items to vote in the poll

There's no hurry. The poll is open until the end of September

Knowing the expected instruction mix in existing vCPU programs and added value, speedups for example.

De concept proposals above are as concrete as I can make them without actually implementing, testing and characterising.

Is this the complete document that defines 'vCPU binary compatibility' ?

It looks like you found links to (some of) the documents in the Docs/ directory of the ROM repository. That's where we intend to maintain all ROM documentation.

- are all opcode/mnemonic arguments 16 bit or sometimes also 8 bit ?

Code: Select all

The vCPU interpreter has 34 core instructions. [..]
Most instructions take a single byte operand, but some have two and others none.

The formulas are pseudo-formal. Better take them with a grain of salt.

Mnem. Encoding  #C Description
----- --------- -- -----------
ST    $5E DD    16 Store byte in zero page ([D]=vAC&256)
STW   $2B DD    20 Store word in zero page ([D],[D+1]=vAC&255,vAC>>8)
LD    $1A DD    18 Load byte from zero page (vAC=[D])
LDI   $59 DD    16 Load immediate small positive constant (vAC=D)
LDWI  $11 LL HH 20 Load immediate word constant (vAC=$HHLL)

So 'DD' or 'D' are intended to represent a single byte. 'LL HH' is intended to represent a 16-bit word in little-endian order. '[X]' is intended to represent a RAM location, as in the native instruction set. 'vAC=D' is intended to mean that the (16-bits) vAC gets the (unsigned) 8-bit value D, and therefore clearing its high 8 bits in the process.

- the width and signed/unsignedness of each register,

From the same link:

Code: Select all

The virtual regisers are 16-bits (except vSP) and reside in the zero page.

vAC  is the virtual accumulator [...]
vPC  is the virtual program counter [...]
vLR  is the link register [...]
vSP  is the stack pointer. The stack lives in the zero page top and grows down.

So vAC, vPC and vLR are 16-bits, and vSP is 8-bits.

Registers don't have a signedness by themselves. They're just untyped bits as in any other processor. Some instructions must interpret them in a certain way however. The only ones where this matters for us are BGT, BLT, BGE, BLE: they treat vAC's contents as signed as they compare vAC to zero (>0, <0, ≥0, ≤0). ADDxxx and SUBxxx instructions don't care about the sign because of the two-complement interpretation by these four.

and other state inside the vCPU

From the README.md file in the top level of the repository (the one that GitHub displays) we have these:

Code: Select all

[Note: In the next section, names in parentheses refer to *internal*
variables that are subject to change between ROM versions. See for a
more detailed explanation on compatibility the file Docs/GT1-files.txt]

Address   Name          Description
--------  ------------- -----------
0015      (vTicks)      Remaining interpreter ticks (=units of 2 CPU cycles)
0016-0017 vPC           Interpreter program counter, points into RAM
0018-0019 vAC           Interpreter accumulator, 16-bits
001a-001b vLR           Return address, for returning after CALL
001c      vSP           Stack pointer
001d      (vTmp)        Scratch storage location for vCPU
001e      (vReturn)     Return address (L) from vCPU into the loop (H is fixed)

So there are more variables holding vCPU state, but they tend to change between ROM versions.

- method/library/call interface/ROM offsets for more complex functionality 'BIOS calls/software interrupts', framebuffer, character set

That's what SYS extensions do. Docs/SYS-functions.txt attempts to document how the mechanism works. It's a summary of several explanations given here in the forum and from comments in the source code.

There are numerous example programs in Apps/ that people use to figure out how to do specific things with them. The simpler ones (Blinky, HelloWorld) and the more recent ones (TinyBASIC) are much better commented than the older ones (Racer).

The online BASIC tutorial goes pretty far in explaining the video and sound system.

At first glance it seems hardly more functional than the native hardware CPU instructionset, except 16 bit - but still having the same-ROM page limitations everywhere.

From the first paragraph of the document you linked to:

Code: Select all

vCPU's advantages over native 8-bit Gigatron code are:
 1. you don't need to think about video timing with everything you do
 2. operations are 16-bits, and
 3. programs can run from RAM.

Is the reason we have applications at all: even Snake wouldn't exist without vCPU. The cycle counting to properly generate horizontal sync pulses every 200 cycles is tedious for regular application level code. In theory it can also be handled by a compiler, but nobody has walked that path yet.
Is not really essential, although it makes it easier to build typical applications quickly. It's also more efficient than handling just 8-bits worth of data per instruction (more "bang per buck").
Is the reason you can load programs into the Gigatron with Loader and without changing the EPROM.

perhaps better formatting in a table and releasing it as a PDF like other CPU instructionsets makes it easier to understand.

I believe everything is documented and every question gets a decent answer (and that often results in documentation updates as well). That has led to quite a few programs, emulators, and even a C compiler. Reconciling, in a PDF version, what we have written down already is an excellent suggestion for a next phase. Thank you!

2.
Then, the vCPU binary compatibility is technically not needed - the hardware is executing the 128 kbyte ROM code - but since it is currently not user-programmable (no ROM emulators in wide usage), I understand your wish of defining a new variant to having 'binary compatibility' for vCPU programs loaded and interpreted in the 32 kbyte RAM, using the vCPU in ROM. Perhaps also to maintain pluggyMcPlugface compatibility.

We don't want to render existing programs worthless by making arbitrary incompatibility changes. People put a lot of effort into making those programs. When they share them in Contrib/ as .gt1 files, we even make an effort to test them as well.

3.
Are there any suggestions for suitable off-the-shelf EPROM emulators for users that want to run their own ROMs and experiment at that level ?

Perhaps that's best discussed in a different thread. The FAQ has some ideas.

DavidHK · Post by **DavidHK** » 11 Sep 2019, 17:28

I have put my vote on CMPW and HOP as those appear to me to be the most useful based on the little experience I have had so far with writing GCL programs.

But maybe I missed something about the v6502 mode? Would that mode help when writing programs that are purely targeted at the Gigatron?

marcelk · Post by **marcelk** » 19 Sep 2019, 14:49

marcelk wrote: We can squeeze in at least two new vCPU instructions without breaking compatibility of existing GT1 files. We can do this by diverting INC and ALLOC, after entry, to another page and continue there. This makes them 6 cycles slower in total, but we can reuse the won space as the landing spot for new instructions.

I don't why I wrote that ALLOC is suitable for creating an extra landing spot, because it isn't long enough for that. However, it appears I missed that LDWI, LDW and STW are each 9 words or more. That means that each can be converted into 3 landing spots, or make way for two new instructions. LD, ANDI (and INC) should be able to create one extra as well.

I'm hesitant to slow down LDW, STW, INC and ANDI, because intuitively they are used a lot (but that's easy to verify with some instrumentation). LDWI and LD look less important to me. So we can at least create 3 new instructions, maybe even up to 9 if benchmarks support it. Maybe I'm overlooking something again (such as the need for the Y register in some of these).

CMPW $DD

I identified two issues with this: first is that signed and unsigned comparison each need their own instruction. Second is that my best efforts on paper take 31 cycles (NEXT-to-NEXT). That's too long: new instructions must all strictly fit in 28 cycles, that's a hard limit. So the outlook for CMPW isn't clear yet.

HOP $DD

This will work. When I wrote it down on paper, I found that jumping to an arbitrary 16-bit address is just as easy. It makes the instruction 1 byte longer, and take the same number of cycles. It's better called JMP in that case, or JMPI. I don't believe that the extra byte is a very compelling reason to discard such a JMP $HHLL: JMP is better than HOP.

But then I looked further, and JMP/JMPI could just as well store the old vPC in vLR. It then still functions as a JMP, but also as a CALL you can return with. So I tend to replace HOP with:

CALLI $HHLL

It occupies 1 byte more than the regular CALL $DD, but it reduces the pressure on the zero page. When your program extends into the 120 short segments behind every pixel row, you learn that you run out of zero page space before you run out of code space. CALLI is better than JMP...

It acts as a HOP as well, but it clobbers vLR. But functions that need to HOP are probably not leaf functions to begin with. So they already have their PUSH/POP in place.

LSRW

So far nobody wants it. Good.

SW02

I believed it would be a nice way to get more compact code when dealing with 8-bit quantities. But it isn't popular. And admittedly it's a bit weird indeed.

marcelk · Post by **marcelk** » 05 Oct 2019, 15:43

The development ROM has just received 3 experimental new vCPU instructions. These new ones are still mostly untested/concepts. But as such "open heart surgeries" can introduce unexpected problems elsewhere, I pushed them anyway so we have more chance to find regressions. Caveat emptor if you live on dev.rom!

CALLI $HHLL

CALLI is as described in the previous post. It can be used as HOP in larger functions, but it also serves to reduce the zero-page pressure for larger programs.

CMPHS $DD
CMPHU $DD

CMPHS and CMPHU are completely new. They assist with full-range integer comparisons. They come in place of the originally proposed (and much voted for) CMPW instruction. As it turned out, monolithic CMPW[SU] instructions are too complex for vCPU. These new instructions are therefore intended to be used in conjunction with SUBW. For technical details, see the GitHub commit itself.

The C compiler will at some point benefit from all three instructions. We're not in a hurry there. I believe CALLI is a good general purpose addition to the family. It has the potential to see a lot of use.

A simple test with Mandelbrot showed less than 0.01% slowdown. We still have some space for more instructions, but using that will start to hurt, because it means patching the most popular instructions STW, LDW and LDWI. Patching any of these will cause a 3-5% overall slow down. For example, it turned out that LDWI is used a lot in time-critical code with SYS calls: LDWI address + STW sysFn + SYS. Still, this idiom is about to become obsolete as well. We've recently freed up a lot of space in ROM page 0 with the intent to use that as a big jump table for SYS instructions. That will reduce the need for LDWI in time-critical code fragments.

Post by **at67** » 23 Oct 2019, 15:48

I've been playing with the CALLI instruction, (specifically adding it to my compiler), to replace the following horrid sequence that I insert into the output assembly every time a page jump is required:

Code: Select all

	STW	0xe0
        LDWI	0x0300
	CALL	giga_vAC

0x0300	LDW	0xe0

And it doesn't quite work correctly, at first I thought it was my code trampling over memory, but I tried CALLI in a trivial sample and then single stepped through the Native code with my debugger and it seems the low byte of the CALLI address is being trampled over by the CALLI instruction itself. Looking at the code:

Code: Select all

calli:	0bde 8003  adda $03
        0bdf c21a  st   [$1a]
        0be0 0117  ld   [$17]
        0be1 d61b  st   [$1b],y
        0be2 0d00  ld   [y,x]
        0be3 a002  suba $02
        0be4 c216  st   [$16]
        0be5 de00  st   [y,x++]
        0be6 0d00  ld   [y,x]
        0be7 1403  ld   $03,y
        0be8 e0ca  jmp  y,$ca
        0be9 c217  st   [$17]

It seems the st [y,x++] is the culprit. I assume you are using it as an X increment, but it looks like it is writing back the address-2 each time it is called, a simple adda $02 after the st [$16] looks like it would fix it.

P.S. writing back the lower byte of the address every time it is called should probably be documented somewhere as anyone trying to write self modifying code using the CALLI instruction and relying on it's value might be in for a surprise.

P.P.S. Actually moving the st [y,x++] above the suba $02 would be simpler and more correct.

marcelk · Post by **marcelk** » 23 Oct 2019, 16:51

Thanks. I just pushed what should be the fix. Inserting an instruction isn't possible because this routine is already at 28 cycles. Self-modifying code still works: it writes back the value it just read. Our code does this a lot (to advance the memory address indeed).

Post by **at67** » 23 Oct 2019, 23:39

No issues with CALLI on the new ROM so far, works correctly in emulation and on hardware with some pretty complex examples.

Cheers.

Gigatron Hackers

New vCPU instructions? (poll)

Which vCPU instruction to add?

New vCPU instructions? (poll)

Re: New vCPU instructions? (poll)

Re: New vCPU instructions? (poll)

Re: New vCPU instructions? (poll)

Re: New vCPU instructions? (poll)

Re: New vCPU instructions? (poll)

Re: New vCPU instructions? (poll)

Re: New vCPU instructions? (poll)

Re: New vCPU instructions? (poll)

Re: New vCPU instructions? (poll)