Re: New vCPU instructions 2.0
Posted: 26 Mar 2021, 15:19
I'd be glad if you put your experimental rom in a github clone of gigatron-rom. I have lost track of the changes at this point .
A place for Gigatron builders and hackers
https://forum.gigatron.io/
I have a hard time being persuaded by 1), the CMPH instructions were never fully tested, let alone used by anyone and lets face it they were designed to be used by future compiler writers to try and get around vCPU's inherent 16bit signed format and the lack of hardware condition codes/flags. The problem I have with these 2 instructions is that they only solve your issue for 16bit int's, once you decide to implement 32bit int's then you will be faced with the exact same issue again. (Although you could probably use the CMPH instructions in a 32bit int solution as well, it would be another train-wreck of vCPU asm, similar to Marcel's TinyBASIC solution).
Now this I am completely persuaded by, we have a number of possible options here:lb3361 wrote: ↑30 Mar 2021, 14:38 2) In C you'll find plenty of loops written as "for(i=0; i<n; i++) { .... }" Each of them involves comparing signed int's. Even when the loop variable is a char, the char will be promoted as int before any operation. Of course one could write "for(i=0;i!=n;i++){...}" or even "for(i=n-1;i>=0;i--) {..}" and get more efficient code, using a cheaper equality test, or a cheaper comparison with a small constant. But lots of existing code might suffer.
Code: Select all
MOVQW sysFn, SYS_CompareInt16_vX_44
LDWI 0x3032
SYS 44
Code: Select all
MOVQW sysFn, SYS_CompareInt16_vX_44
LDWI 0x3032
_loop:
SYS 44
.
.
BRA _loop
Code: Select all
# ADDW implementation, *note* : cycle 12 in page 3 is ld(0,Y)
#label('addw#13')
#ld(AC,X) #13 Address of low byte to be added
#ld([vAC]) #14 Low byte
#st([vTmp]) #15 Store low result
#adda([X]) #16
#st([vAC]) #17
#bmi('.addw#20') #18 Now figure out if there was a carry
#ld([X]) #19
#st([Y,Xpp]) #20
#ora([vTmp]) #21
#bmi('.addw#24_0') #22 If Carry == 1
#ld([X]) #23
#adda([vAC+1]) #24
#st([vAC+1]) #25 Store high result
#ld(hi('NEXTY'),Y) #26
#jmp(Y,'NEXTY') #27
#ld(-30/2) #28
#label('.addw#24_0')
#adda(1) #24
#adda([vAC+1]) #25
#st([vAC+1]) #26 Store high result
#ld(hi('REENTER'),Y) #27
#jmp(Y,'REENTER') #28
#ld(-32/2) #29
#label('.addw#20')
#st([Y,Xpp]) #20
#anda([vTmp]) #21
#bmi('.addw#24_1') #22 If Carry == 1
#ld([X]) #23
#adda([vAC+1]) #24
#st([vAC+1]) #25 Store high result
#ld(hi('NEXTY'),Y) #26
#jmp(Y,'NEXTY') #27
#ld(-30/2) #28
#label('.addw#24_1')
#adda(1) #24
#adda([vAC+1]) #25
#st([vAC+1]) #26 Store high result
#ld(hi('REENTER'),Y) #27
#jmp(Y,'REENTER') #28
#ld(-32/2) #29
# ADDW implementation
#label('addw#13')
#ld(AC,X) #13 Address of low byte to be added
#adda(1) #14
#st([vTmp]) #15 Address of high byte to be added
#ld([vAC]) #16 Add the low bytes
#adda([X]) #17
#st([vAC]) #18 Store low result
#bmi('.addw#21') #19 Now figure out if there was a carry
#suba([X]) #20 Gets back the initial value of vAC
#ora([X]) #21 Carry in bit 7
#anda(0x80,X) #22 Move carry to bit 0
#ld([X]) #23
#adda([vAC+1]) #24 Add the high bytes with carry
#ld([vTmp],X) #25
#adda([X]) #26
#st([vAC+1]) #27 Store high result
#ld(hi('NEXTY'),Y) #28
#jmp(Y,'NEXTY') #29
#ld(-32/2) #30
#label('.addw#21')
#anda([X]) #21 Carry in bit 7
#anda(0x80,X) #22 Move carry to bit 0
#ld([X]) #23
#adda([vAC+1]) #24 Add the high bytes with carry
#ld([vTmp],X) #25
#adda([X]) #26
#st([vAC+1]) #27 Store high result
#ld(hi('NEXTY'),Y) #28
#jmp(Y,'NEXTY') #29
#ld(-32/2) #30
Code: Select all
namespace TestCarry
{
// Calculate carry using signed 8bit representation for : r = a + b
//
// Sr = sgn(r), Sa = sgn(a), Sb = sgn(b), C = carry
//
// Sr | Sa | Sb || C
// ----+----+----++---
// 0 | 0 | 0 || 0
// 0 | 0 | 1 || 1
// 0 | 1 | 0 || 1
// 0 | 1 | 1 || 1
// 1 | 0 | 0 || 0
// 1 | 0 | 1 || 0
// 1 | 1 | 0 || 0
// 1 | 1 | 1 || 1
//
// __ __ __ __
// Sa.Sb Sa.Sb Sa.Sb Sa.Sb
// __ +-----+-----+-----+-----+
// Sr | 0 | 1 | 1 | 1 |
// +-----+-----+-----+-----+
// Sr | 0 | 0 | 1 | 0 |
// +-----+-----+-----+-----+
//
// __ __
// C = Sa.Sb + Sr.Sb + Sr.Sa
// __
// C = Sa.Sb + Sr.(Sb + Sa)
void calcCarry(void)
{
int8_t a = 102, b = 154, r = 0, c0 = 0, c1 = 0, c2 = 0;
r = a + b;
c0 = (bool(a & 0x80) & bool(b & 0x80)) + (!bool(r & 0x80) & (bool(b & 0x80) + bool(a & 0x80)));
if((a < 0 && b < 0) || (r >= 0 && (a < 0 || b < 0))) c1 = 1;
if(r < 0)
{
c2 = bool((a & b) & 0x80);
}
else
{
c2 = bool((a | b) & 0x80);
}
fprintf(stderr, "%d %d %d %d %d %d\n", a, b, r, c0, c1, c2);
}
}
I find it much harder to do than you say to be frank.at67 wrote: ↑01 Apr 2021, 08:12 a good compiler could cache the Sys call setup invariant's in loops and blocks using the same sized types, i.e. for the example you gave above or a complex boolean if statement, the compiler could statically analyse the code and do something like this, (this obviously assumes that other Sys calls are not being used within the loop/block and the two registers to be compared do not change and that the same sized types are being compared throughout the entire loop/block):
I believe that Marcel's solution was that CMPWS = CMPHS+SUBW. He just split the CMPWS work in two instructions, one of them already existing.P.S. If you do have a go at the CMPWS instruction, it does not have to be numerically accurate, it only needs to produce a result that is valid for the BCC instructions, i.e. the output, (vAC), sign bit must be valid and the output must contain all zero bits for any of the EQ branches. This may save you a couple of native instruction slots/cycles, (it did for me when I coded up the CMPI var, imm instruction, but still ended up too large for max-ticks = 30).
Code: Select all
void loop1(int *p, int n, int v)
{
int i;
for (i=0; i<n; i++)
p[i] = v;
}
void loop2(int *p, int v)
{
int i;
for (i=0; i<12; i++)
p[i] = v;
}
void loop3(int *p, int n, int v)
{
int i;
for (i=n-1; i>=0; i--)
p[i] = v;
}
double fabs(double x)
{
if (x >= 0)
return x;
else
return -x;
}
Code: Select all
x.export('loop1');
x.segment('CODE');
# begin function 'loop1'
x.label('loop1');
x.LDW('vLR');x.STW(LR);x._SP(-2);x.STW(SP);x._SP(0);x._DOKEA(R23);
####{
#### for (i=0; i<n; i++)
x.LDI(0);x.STW(R23);
x._BRA('.5');
x.label('.2');
#### p[i] = v;
x.LDW(R23);x.LSLW();x.ADDW(R8);x._DOKEA(R10);
x.label('.3');
#### for (i=0; i<n; i++)
x.LDI(1);x.ADDW(R23);x.STW(R23);
x.label('.5');
x._CMPS(R23,R9);x._BLT('.2');
####}
x.label('.1');
x._SP(0);x.DEEK();x.STW(R23);x._SP(2);x.STW(SP);x.LDW(LR);x.STW('vLR');x.RET();
# end function 'loop1'
x.export('loop2');
# begin function 'loop2'
x.label('loop2');
x.LDW('vLR');x.STW(LR);x._SP(-2);x.STW(SP);x._SP(0);x._DOKEA(R23);
####{
#### for (i=0; i<12; i++)
x.LDI(0);x.STW(R23);
x.label('.7');
#### p[i] = v;
x.LDW(R23);x.LSLW();x.ADDW(R8);x._DOKEA(R9);
x.label('.8');
#### for (i=0; i<12; i++)
x.LDI(1);x.ADDW(R23);x.STW(R23);
x.LDW(R23);x.SUBI(12);x._BLT('.7');
####}
x.label('.6');
x._SP(0);x.DEEK();x.STW(R23);x._SP(2);x.STW(SP);x.LDW(LR);x.STW('vLR');x.RET();
# end function 'loop2'
x.export('loop3');
# begin function 'loop3'
x.label('loop3');
x.LDW('vLR');x.STW(LR);x._SP(-2);x.STW(SP);x._SP(0);x._DOKEA(R23);
####{
#### for (i=n-1; i>=0; i--)
x.LDW(R9);x.SUBI(1);x.STW(R23);
x._BRA('.15');
x.label('.12');
#### p[i] = v;
x.LDW(R23);x.LSLW();x.ADDW(R8);x._DOKEA(R10);
x.label('.13');
#### for (i=n-1; i>=0; i--)
x.LDW(R23);x.SUBI(1);x.STW(R23);
x.label('.15');
x.LDW(R23);x._BGE('.12');
####}
x.label('.11');
x._SP(0);x.DEEK();x.STW(R23);x._SP(2);x.STW(SP);x.LDW(LR);x.STW('vLR');x.RET();
# end function 'loop3'
x.export('fabs');
# begin function 'fabs'
x.label('fabs');
x.LDW('vLR');x.STW(LR);
####{
#### if (x >= 0)
x._FMOV(F8,FAC);x.LDWI('.19');x._FPEEKA(FARG);x._FCMP();x._BLT('.17');
#### return x;
x._FMOV(F8,FAC);
x._BRA('.16');
x.label('.17');
#### return -x;
x._FMOV(F8,FAC);x._FNEG();
x.label('.16');
x.LDW(LR);x.STW('vLR');x.RET();
# end function 'fabs'
x.segment('LIT');
x.label('.19');
x.bytes(0,0,0,0,0) # 0.000000
Yes, but originally he tried to create the CMPWU and CMPWS instructions which are obviously more efficient than the CMPH : SUBW pair, thus I thought it might be worth re-visiting Marcel's original attempt now that we have 30 max-ticks to play with instead of 28.
This looks great, are you working with the unfinished lcc project in gigatron-rom/Libs or have you rolled your own?
I rolled my own, with a very different code generation strategy in fact.
Here is an amusing idea. Make CMPWS instruction be represented by a three byte sequence, XX B8 DD, where XX=lo('CMPWS') and B8=lo('SUBW'). Note that it is probably not work breaking backward compatiblity. After all 1F DD B8 DD already works.
Code: Select all
label('CMPWS')
ld(hi('cmpws#13'),Y) #10
jmp(Y,'cmpws#13') #11
st([Y,Xpp]) #12 just x++
....
label('cmpws#13')
ld([Y,X]) #13 fetch operand addr
ld(AC,X) #14 into X
ld([X]) #15 operand
xor([vAC+1]) #16
bpl('cmpws#19') #17
ld([vPC]) #18
adda(1) #19
st([vPC]) #20
ld([vAC+1]) #21
ora(1) #22
st([vAC+1]) #23
ld(hi('NEXTY'),Y) #24
jmp(y,'NEXTY') #25
ld(-28/2) #26
....
label('cmpws#19')
subi(1) #19
st([vPC+1]) #20
ld(hi('REENTER'),Y) #21
jmp(y,'REENTER') #22
ld(-26/2) #26
Code: Select all
label('MOVW')
ld(hi('movw#13'),Y) #10
jmp(Y,'movw#13') #11
st([Y,Xpp]) #12 just x++
....
label('movw#13')
ld([Y,Xpp]) #13
adda(1) #14
st([vTmp]) #15
ld([Y,X]) #16
adda(1,X) #17
ld([X]) #18
ld([vTmp],X) #19
st([X]) #20
ld([vPC]) #21
subi(1) #22
st([vPC]) #23
ld(hi('NEXTY'),Y) #24
jmp(y,'NEXTY') #25
ld(-28/2) #26
That is actually a very cool way of producing extra instruction functionality at minimal increase in instruction byte size, I think this idea has potential for some edge cases.