New vCPU instructions 2.0

lb3361 · Post by **lb3361** » 26 Mar 2021, 15:19

I'd be glad if you put your experimental rom in a github clone of gigatron-rom. I have lost track of the changes at this point

.

lb3361 · Post by **lb3361** » 30 Mar 2021, 14:38

I know you have little love for CMPHI and CMPHU but I have two arguments in their favor:

1) Backward compatibility. Cpuv5 was a strict superset of cpuv4. It is annoying if cpuv6 is not a superset of cpuv5.

2) In C you'll find plenty of loops written as "for(i=0; i<n; i++) { .... }" Each of them involves comparing signed ints. Even when the loop variable is a char, the char will be promoted as int before any operation. Of course one could write "for(i=0;i!=n;i++){...}" or even "for(i=n-1;i>=0;i--) {..}" and get more efficient code, using a cheaper equality test, or a cheaper comparison with a small constant. But lots of existing code might suffer.

Post by **at67** » 01 Apr 2021, 08:12

lb3361 wrote: ↑30 Mar 2021, 14:38 I know you have little love for CMPHI and CMPHU but I have two arguments in their favor:

1) Backward compatibility. Cpuv5 was a strict superset of cpuv4. It is annoying if cpuv6 is not a superset of cpuv5.

I have a hard time being persuaded by 1), the CMPH instructions were never fully tested, let alone used by anyone and lets face it they were designed to be used by future compiler writers to try and get around vCPU's inherent 16bit signed format and the lack of hardware condition codes/flags. The problem I have with these 2 instructions is that they only solve your issue for 16bit int's, once you decide to implement 32bit int's then you will be faced with the exact same issue again. (Although you could probably use the CMPH instructions in a 32bit int solution as well, it would be another train-wreck of vCPU asm, similar to Marcel's TinyBASIC solution).

There has to be a better way...

lb3361 wrote: ↑30 Mar 2021, 14:38 2) In C you'll find plenty of loops written as "for(i=0; i<n; i++) { .... }" Each of them involves comparing signed int's. Even when the loop variable is a char, the char will be promoted as int before any operation. Of course one could write "for(i=0;i!=n;i++){...}" or even "for(i=n-1;i>=0;i--) {..}" and get more efficient code, using a cheaper equality test, or a cheaper comparison with a small constant. But lots of existing code might suffer.

Now this I am completely persuaded by, we have a number of possible options here:

1) Do nothing, (i.e. leave the new ROM as it currently is with the ejected CMPH instructions), and rely on the programmer to fully understand the architecture's inherent vCPU numerical system, it's limitations and consequences.

2) Re-enable the CMPH instructions, by either finding new free slots or ejecting 2 of the new instructions if no free slots are available.

3) Have a go at implementing Marcel's original idea of CMPWS/CMPWU in the new 30 max-ticks infrastructure, (he did mention in the original thread that when he first implemented the instruction, the best he could do was 31 cycles or ticks), NEXT-to-NEXT, here : https://forum.gigatron.io/viewtopic.php?p=935#p935

4) Implement the functionality for a proper CMPWS/CMPWU and CMPDS/CMPDU in Sys calls.

5) Create an unsigned version of the BCC instruction, (whether as one instruction with 2 operands, like the current BCC, or as multiple separate instructions with one operand).

1: This is what I was initially planning as I assumed the C compiler was mostly dead in the water.

2: This one is a hard sell for me, (if we can't find any new free instruction slots), I have already written a lot of software taking advantage of the new instructions and basically become dependent on every new instruction so far introduced, i.e. I have re-hand-coded every gtBASIC runtime sub, (and there is a lot of them), using the new instructions.

3: This one seems like a little better option than 2, as worst case if we do have to eject two of the new instructions, at least we will be replacing them with proper instructions rather than the CMPH SUBW band-aid.

4: This is the option I would personally choose, yes Sys calls can have a large prelude negating their advantages when replacing small sequences of vCPU instructions, but there are ways to get around that, (especially for the compiler writer). e.g:

the Sys call could specify two zero page starting registers, (of any width, i.e. 16bit, 32bit, etc), to compare as the low and high bytes in vAC, so the following code would compare 0x30 and 0x32 as two 16bit signed values taking overflow/underflow into account.
Code: Select all
```
MOVQW   sysFn, SYS_CompareInt16_vX_44
LDWI    0x3032
SYS     44
```

a good compiler could cache the Sys call setup invariant's in loops and blocks using the same sized types, i.e. for the example you gave above or a complex boolean if statement, the compiler could statically analyse the code and do something like this, (this obviously assumes that other Sys calls are not being used within the loop/block and the two registers to be compared do not change and that the same sized types are being compared throughout the entire loop/block):
Code: Select all
```
        MOVQW   sysFn, SYS_CompareInt16_vX_44
        LDWI    0x3032

_loop:
        SYS     44
             .
             .
        BRA     _loop
```

you can produce Sys calls that handle as many different types as you like, signed, unsigned, 8bit, 16bit, 24bit, 32bit, fixed point, etc

5: I am not sure this is possible outside of ROM page 3, even with max-ticks = 30 and finding 6 new instructions slots for the one operand versions is probably even less likely possible.

I was going to make an attempt at writing a CMPWS instruction and a Sys call version of it, so that I had more objective data to analyse, but I have too much on my plate as it is, would you be willing to have a go at these yourself?

P.S. If you do have a go at the CMPWS instruction, it does not have to be numerically accurate, it only needs to produce a result that is valid for the BCC instructions, i.e. the output, (vAC), sign bit must be valid and the output must contain all zero bits for any of the EQ branches. This may save you a couple of native instruction slots/cycles, (it did for me when I coded up the CMPI var, imm instruction, but still ended up too large for max-ticks = 30).

Post by **at67** » 03 Apr 2021, 15:08

I've spent a couple of nights trying to move ADDW from page3 as this will give us around 7 new instruction slots, (it's the last candidate left that I think can be moved, SUBW can't be moved because of the REENTER and REENTER_28 labels, BCC can't be moved because there is no possible way to implement it in under 30 cycles in an external page, so ADDW is really the only candidate left.

I've coded up multiple versions, but the two best were one that is 2 cycles too long, (for maxticks = 30), and another that has two paths that are 1 cycle too long; here's the code, truth table, K-map and C test code, if you or anyone else wants to have a go at getting this instruction to fit into 30 cycles.

Code: Select all

# ADDW implementation, *note* : cycle 12 in page 3 is ld(0,Y)
#label('addw#13')
#ld(AC,X)                        #13 Address of low byte to be added
#ld([vAC])                       #14 Low byte
#st([vTmp])                      #15 Store low result
#adda([X])                       #16
#st([vAC])                       #17
#bmi('.addw#20')                 #18 Now figure out if there was a carry
#ld([X])                         #19
#st([Y,Xpp])                     #20
#ora([vTmp])                     #21
#bmi('.addw#24_0')               #22 If Carry == 1
#ld([X])                         #23
#adda([vAC+1])                   #24
#st([vAC+1])                     #25 Store high result
#ld(hi('NEXTY'),Y)               #26
#jmp(Y,'NEXTY')                  #27
#ld(-30/2)                       #28
#label('.addw#24_0')
#adda(1)                         #24
#adda([vAC+1])                   #25
#st([vAC+1])                     #26 Store high result
#ld(hi('REENTER'),Y)             #27
#jmp(Y,'REENTER')                #28
#ld(-32/2)                       #29
#label('.addw#20')
#st([Y,Xpp])                     #20
#anda([vTmp])                    #21
#bmi('.addw#24_1')               #22 If Carry == 1
#ld([X])                         #23
#adda([vAC+1])                   #24
#st([vAC+1])                     #25 Store high result
#ld(hi('NEXTY'),Y)               #26
#jmp(Y,'NEXTY')                  #27
#ld(-30/2)                       #28
#label('.addw#24_1')
#adda(1)                         #24
#adda([vAC+1])                   #25
#st([vAC+1])                     #26 Store high result
#ld(hi('REENTER'),Y)             #27
#jmp(Y,'REENTER')                #28
#ld(-32/2)                       #29

# ADDW implementation
#label('addw#13')
#ld(AC,X)                        #13 Address of low byte to be added
#adda(1)                         #14
#st([vTmp])                      #15 Address of high byte to be added
#ld([vAC])                       #16 Add the low bytes
#adda([X])                       #17
#st([vAC])                       #18 Store low result
#bmi('.addw#21')                 #19 Now figure out if there was a carry
#suba([X])                       #20 Gets back the initial value of vAC
#ora([X])                        #21 Carry in bit 7
#anda(0x80,X)                    #22 Move carry to bit 0
#ld([X])                         #23
#adda([vAC+1])                   #24 Add the high bytes with carry
#ld([vTmp],X)                    #25
#adda([X])                       #26
#st([vAC+1])                     #27 Store high result
#ld(hi('NEXTY'),Y)               #28
#jmp(Y,'NEXTY')                  #29
#ld(-32/2)                       #30
#label('.addw#21')
#anda([X])                       #21 Carry in bit 7
#anda(0x80,X)                    #22 Move carry to bit 0
#ld([X])                         #23
#adda([vAC+1])                   #24 Add the high bytes with carry
#ld([vTmp],X)                    #25
#adda([X])                       #26
#st([vAC+1])                     #27 Store high result
#ld(hi('NEXTY'),Y)               #28
#jmp(Y,'NEXTY')                  #29
#ld(-32/2)                       #30

Code: Select all

namespace TestCarry
{
    // Calculate carry using signed 8bit representation for : r = a + b
    //
    // Sr = sgn(r), Sa = sgn(a), Sb = sgn(b), C = carry
    //
    //  Sr | Sa | Sb || C
    // ----+----+----++---
    //   0 |  0 |  0 || 0
    //   0 |  0 |  1 || 1
    //   0 |  1 |  0 || 1
    //   0 |  1 |  1 || 1
    //   1 |  0 |  0 || 0
    //   1 |  0 |  1 || 0
    //   1 |  1 |  0 || 0
    //   1 |  1 |  1 || 1
    //
    //     __ __ __             __
    //     Sa.Sb Sa.Sb Sa.Sb Sa.Sb 
    // __ +-----+-----+-----+-----+
    // Sr |  0  |  1  |  1  |  1  |
    //    +-----+-----+-----+-----+
    // Sr |  0  |  0  |  1  |  0  |
    //    +-----+-----+-----+-----+
    //
    //             __      __
    // C = Sa.Sb + Sr.Sb + Sr.Sa
    //             __
    // C = Sa.Sb + Sr.(Sb + Sa)

    void calcCarry(void)
    {
        int8_t a = 102, b = 154, r = 0, c0 = 0, c1 = 0, c2 = 0;

        r = a + b;

        c0 = (bool(a & 0x80) & bool(b & 0x80))  +  (!bool(r & 0x80) & (bool(b & 0x80) + bool(a & 0x80)));

        if((a < 0  &&  b < 0)  ||  (r >= 0  &&  (a < 0  ||  b < 0))) c1 = 1;

        if(r < 0)
        {
            c2 = bool((a & b) & 0x80);
        }
        else
        {
            c2 = bool((a | b) & 0x80);
        }
        fprintf(stderr, "%d %d %d %d %d %d\n", a, b, r, c0, c1, c2);
    }
}

lb3361 · Post by **lb3361** » 04 Apr 2021, 19:21

at67 wrote: ↑01 Apr 2021, 08:12 a good compiler could cache the Sys call setup invariant's in loops and blocks using the same sized types, i.e. for the example you gave above or a complex boolean if statement, the compiler could statically analyse the code and do something like this, (this obviously assumes that other Sys calls are not being used within the loop/block and the two registers to be compared do not change and that the same sized types are being compared throughout the entire loop/block):

I find it much harder to do than you say to be frank.

P.S. If you do have a go at the CMPWS instruction, it does not have to be numerically accurate, it only needs to produce a result that is valid for the BCC instructions, i.e. the output, (vAC), sign bit must be valid and the output must contain all zero bits for any of the EQ branches. This may save you a couple of native instruction slots/cycles, (it did for me when I coded up the CMPI var, imm instruction, but still ended up too large for max-ticks = 30).

I believe that Marcel's solution was that CMPWS = CMPHS+SUBW. He just split the CMPWS work in two instructions, one of them already existing.

Aside. I am making good progress with the compiler. It no longer loops forever. It can compile all of itself.
For instance

Code: Select all

void loop1(int *p, int n, int v)
{
        int i;
        for (i=0; i<n; i++)
                p[i] = v;
}

void loop2(int *p, int v)
{
        int i;
        for (i=0; i<12; i++)
                p[i] = v;
}

void loop3(int *p, int n, int v)
{
        int i;
        for (i=n-1; i>=0; i--)
                p[i] = v;
}

double fabs(double x)
{
        if (x >= 0)
                return x;
        else
                return -x;
}

gives

Code: Select all

	x.export('loop1');
	x.segment('CODE');
# begin function 'loop1'
	x.label('loop1');
	x.LDW('vLR');x.STW(LR);x._SP(-2);x.STW(SP);x._SP(0);x._DOKEA(R23);
####{
####	for (i=0; i<n; i++)
	x.LDI(0);x.STW(R23);
	x._BRA('.5');
	x.label('.2');
####		p[i] = v;
	x.LDW(R23);x.LSLW();x.ADDW(R8);x._DOKEA(R10);
	x.label('.3');
####	for (i=0; i<n; i++)
	x.LDI(1);x.ADDW(R23);x.STW(R23);
	x.label('.5');
	x._CMPS(R23,R9);x._BLT('.2');
####}
	x.label('.1');
	x._SP(0);x.DEEK();x.STW(R23);x._SP(2);x.STW(SP);x.LDW(LR);x.STW('vLR');x.RET();
# end function 'loop1'
	x.export('loop2');
# begin function 'loop2'
	x.label('loop2');
	x.LDW('vLR');x.STW(LR);x._SP(-2);x.STW(SP);x._SP(0);x._DOKEA(R23);
####{
####	for (i=0; i<12; i++)
	x.LDI(0);x.STW(R23);
	x.label('.7');
####		p[i] = v;
	x.LDW(R23);x.LSLW();x.ADDW(R8);x._DOKEA(R9);
	x.label('.8');
####	for (i=0; i<12; i++)
	x.LDI(1);x.ADDW(R23);x.STW(R23);
	x.LDW(R23);x.SUBI(12);x._BLT('.7');
####}
	x.label('.6');
	x._SP(0);x.DEEK();x.STW(R23);x._SP(2);x.STW(SP);x.LDW(LR);x.STW('vLR');x.RET();
# end function 'loop2'
	x.export('loop3');
# begin function 'loop3'
	x.label('loop3');
	x.LDW('vLR');x.STW(LR);x._SP(-2);x.STW(SP);x._SP(0);x._DOKEA(R23);
####{
####	for (i=n-1; i>=0; i--)
	x.LDW(R9);x.SUBI(1);x.STW(R23);
	x._BRA('.15');
	x.label('.12');
####		p[i] = v;
	x.LDW(R23);x.LSLW();x.ADDW(R8);x._DOKEA(R10);
	x.label('.13');
####	for (i=n-1; i>=0; i--)
	x.LDW(R23);x.SUBI(1);x.STW(R23);
	x.label('.15');
	x.LDW(R23);x._BGE('.12');
####}
	x.label('.11');
	x._SP(0);x.DEEK();x.STW(R23);x._SP(2);x.STW(SP);x.LDW(LR);x.STW('vLR');x.RET();
# end function 'loop3'
	x.export('fabs');
# begin function 'fabs'
	x.label('fabs');
	x.LDW('vLR');x.STW(LR);
####{
####	if (x >= 0)
	x._FMOV(F8,FAC);x.LDWI('.19');x._FPEEKA(FARG);x._FCMP();x._BLT('.17');
####		return x;
	x._FMOV(F8,FAC);
	x._BRA('.16');
	x.label('.17');
####		return -x;
	x._FMOV(F8,FAC);x._FNEG();
	x.label('.16');
	x.LDW(LR);x.STW('vLR');x.RET();
# end function 'fabs'
	x.segment('LIT');
	x.label('.19');
	x.bytes(0,0,0,0,0) # 0.000000

Instructions starting with an underscore are synthetized in the assembler (long jumps, long sequences that obscure the meaning). I use a bank of 32 registers (pairs for longs, triples for floats). When needed, R2 to R7 contain the long and float accumulator and arg registers (LAC/LARG/FAC/FARG). R8 to R13 can be used to pass arguments. R14 to R23 are for local variable (callee-saved). R24 to R29 are always available for temporaries. R30 is a copy of vLR, R31 is my stack pointer. Floats are 5 bytes using the MSBASIC FP format.

Post by **at67** » 05 Apr 2021, 09:28

lb3361 wrote: ↑04 Apr 2021, 19:21 I believe that Marcel's solution was that CMPWS = CMPHS+SUBW. He just split the CMPWS work in two instructions, one of them already existing.

Yes, but originally he tried to create the CMPWU and CMPWS instructions which are obviously more efficient than the CMPH : SUBW pair, thus I thought it might be worth re-visiting Marcel's original attempt now that we have 30 max-ticks to play with instead of 28.

lb3361 wrote: ↑04 Apr 2021, 19:21 Aside. I am making good progress with the compiler. It no longer loops forever. It can compile all of itself.

This looks great, are you working with the unfinished lcc project in gigatron-rom/Libs or have you rolled your own?

Post by **at67** » 05 Apr 2021, 09:45

Update:

I've managed to find another 8 instruction slots by re-writing the BCC instruction and moving it out of page3 into an external page.

Marcel's original BCC code is magnificent, the amount of work it does in total bytes used is astonishing, my version was far simpler to code as I was able to take advantage of the one resource that the Gigatron has an abundance of, ROM space.

Instructions modified:

BEQ, best case 24 cycles, worst case 26 cycles.
BNE, best case 24 cycles, worst case 26 cycles.
BGT, 26 cycles.
BLT, 24 cycles.
BGE, 24 cycles.
BLE, best case 24 cycles, worst case 26 cycles.

Instructions reinstated:

CMPHU, 28 cycles.
CMPHS, 28 cycles.

New instructions:

XORBI, var.lo ^= imm, 28 cycles.
ANDBA, vAC &= var.lo, 24 cycles.
ORBA, vAC |= var.lo, 22 cycles.
XORBA, vAC ^= var.lo, 22 cycles.
NOTB, var.lo = ~var.lo, 22 cycles.
DOKEX, doke word in vAC to address contained in var, var += 2, 30 cycles.

lb3361 · Post by **lb3361** » 05 Apr 2021, 12:25

at67 wrote: ↑05 Apr 2021, 09:28 [This looks great, are you working with the unfinished lcc project in gigatron-rom/Libs or have you rolled your own?

I rolled my own, with a very different code generation strategy in fact.

at67 wrote: ↑01 Apr 2021, 08:12 P.S. If you do have a go at the CMPWS instruction,...

Here is an amusing idea. Make CMPWS instruction be represented by a three byte sequence, XX B8 DD, where XX=lo('CMPWS') and B8=lo('SUBW'). Note that it is probably not work breaking backward compatiblity. After all 1F DD B8 DD already works.

Code: Select all

label('CMPWS')
ld(hi('cmpws#13'),Y)            #10
jmp(Y,'cmpws#13')               #11
st([Y,Xpp])                     #12   just x++
....
label('cmpws#13')
ld([Y,X])                       #13   fetch operand addr
ld(AC,X)                        #14   into X
ld([X])                         #15   operand
xor([vAC+1])                    #16
bpl('cmpws#19')                 #17
ld([vPC])                       #18
adda(1)                         #19
st([vPC])                       #20
ld([vAC+1])                     #21
ora(1)                          #22
st([vAC+1])                     #23
ld(hi('NEXTY'),Y)               #24
jmp(y,'NEXTY')                  #25
ld(-28/2)                       #26
....
label('cmpws#19')
subi(1)                         #19
st([vPC+1])                     #20
ld(hi('REENTER'),Y)             #21
jmp(y,'REENTER')                #22
ld(-26/2)                       #26

The same trick can be used to finally get MOVW(FF,TT) with a four byte sequence XX YY TT FF where XX=lo('MOVW') and YY=lo('MOVB')

Code: Select all

label('MOVW')
ld(hi('movw#13'),Y)            #10
jmp(Y,'movw#13')               #11
st([Y,Xpp])                    #12   just x++
....
label('movw#13')
ld([Y,Xpp])                    #13
adda(1)                        #14
st([vTmp])                     #15
ld([Y,X])                      #16
adda(1,X)                      #17
ld([X])                        #18
ld([vTmp],X)                   #19
st([X])                        #20
ld([vPC])                      #21
subi(1)                        #22
st([vPC])                      #23
ld(hi('NEXTY'),Y)              #24
jmp(y,'NEXTY')                 #25
ld(-28/2)                      #26

Not any faster, but amusing..

Post by **at67** » 06 Apr 2021, 07:42

lb3361 wrote: ↑05 Apr 2021, 12:25 Here is an amusing idea. Make CMPWS instruction be represented by a three byte sequence, XX B8 DD, where XX=lo('CMPWS') and B8=lo('SUBW'). Note that it is probably not work breaking backward compatiblity. After all 1F DD B8 DD already works.

That is actually a very cool way of producing extra instruction functionality at minimal increase in instruction byte size, I think this idea has potential for some edge cases.

lb3361 · Post by **lb3361** » 06 Apr 2021, 09:22

Here is an idea that you might like. I think this one is useful:

Imagine you reserve a new ROM page, say page X that starts with exactly the same dispatching code as page 3 (NEXTY,NEXT,EXIT,RESYNC) except maybe for minor changes to be discussed later. If one branches the the NEXTY of this page, the next instruction will be dispatched in page X instead of page 3. Now allocate one instruction in page3, named PREFIX, that does only one thing: it returns to NEXT in page X instead of page 3. PREFIX took 12 cycles. But the next instruction now dispatches in page X instead of page 3. For simplicity, let all those pageX instructions return to NEXYY or REENTER in page 3 so that we don't carry a state.

Here are the minor changes:
- the pc adjustment has to be +1 because PREFIX is that long.
- the EXIT code has to record that one must return to page X instead of page 3. One has to set vCpuSelect to page X.
- One needs to do something to reset vCpuSelect to page 4 when one returns from a page X instruction
- One needs to make sure that all interrupt handlers save and restore vCpuSelect (needed for v6502 anyway).

Gigatron Hackers

New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0

Re: New vCPU instructions 2.0