Expanding ROM space

hnaves · Post by **hnaves** » 14 May 2021, 23:38

I am not sure if this is a new idea on this forum, but it might be possible extend the ROM storage space by 50% or more. That means, according to my estimates, more than 30k bytes of extra room for applications in ROM.

As far as I understand, data is stored in ROM and LUP is the vCPU instruction responsible for fetching the data from ROM. It uses a trampoline mechanism (see the trampoline() definition in asm.py), which roughly works like this:

Code: Select all

label("trampoline")
    bra(ac)
    bra("continue_trampoline")

label("continue_trampoline")

    ld(hi("lupReturn#19"), Y)
    jmp(Y, lo("lupReturn#19"))
    st(lo("vAC"))

The double branch (the second one in the branch delay slot of the first) allows the instruction on address [pch,ac] to be executed and then immediately return to "continue_trampoline". Typically, the instruction on target address [pch, ac] will be of the form

Code: Select all

    ld imm

where imm is an immediate constant. These instructions can encode 256 possible values (the immediate value is stored on the ac register). Note that we used 2 bytes in ROM (one for the instruction and one for the immediate) to store only 1 byte of information for the LUP.

However, it is possible to encode more information if one allows the use of different instructions. For example, one could use one of the following instructions instead

Code: Select all

    ld imm, x
    ld imm,y

To detect whether it is ld imm,ac, ld imm,x, and ld imm,y, one could preload ac, x, and y with some sentinel values and detect if any of these registers changed. If no register changed, try with different sentinel values for x and y (one cannot chose the value of ac because ac holds the address of the load instruction at this point). That way we encode 256 * 3 = 768 possible values for a single LUP. In number of bits, it is log(768) which is approximately 9.58 bits of information. Note that the penalty here is a slower LUP, which might take twice as long since we might need to use the trampoline twice.

Approximately 59k entries/addresses of the ROM are dedicated for this type of storage. In other words, 118k bytes of ROM are used for LUP storage. If we can get 1.58 extra bits per byte stored, we would have 11k bytes of extra storage already!

But we can do better than this. If we also allow the following instructions in our encoding:

Code: Select all

    ld [imm]
    ld [y, imm]
    ld [imm], x
    ld [imm], y

It is possible to encode 7 * 256 = 1792 possible values, which amounts to log(3840) ~ 10.8 bits of information. That way we can extend the storage space by 2.8 / 8 ~ 35% which amounts to 20k extra bytes.

One thing that I did not yet explain, is how to distinguish between these types of instructions (to completely recover the imm byte and the instruction itself). For that purpose, we need to be able to control the data on page 0 and on another page, say page 1, (in that case we need to preload the register y with this value).

So the algorithm to recover the instruction is as follows:

First write bytes 0, 1, 2, ... , 255 in that order on page 0;

Execute a trampoline, setting y = 0 and x = 0;
- If the value of ac changed, we know the instruction must be of one the following:
  Code: Select all
```
ld imm
ld [imm],
ld [y, imm]
```
  We also know the value of imm (it is the current value of ac). To distinguish between these 3 cases we first modify the byte at imm on page 0 to an arbitrary value imm' different than imm, and execute the trampoline again. If the value in ac changed to imm' we know the instruction is of the form ld imm'. Otherwise, we change the value at [1, imm] to some arbitrary value imm'' (different than imm), and set y = 1 and execute the trampoline again. Depending on the value of ac, we will be able to distinguish between ld imm, and ld [y, imm]. If the value of ac is imm, then we know that the instruction is of the form ld imm, otherwise ac will hold imm'' and the instruction is ld [y, imm]. In fact, instead of executing 2 more trampolines, we can do in only one by modifying the value at [imm] to imm' and the value at [1, imm] to imm'' simultaneously and executing the trampoline with y = 1 only once;
- If the value of x (or y) changed, we can do a simpler analysis: we know that the instruction is either:
  Code: Select all
```
ld imm, x
ld [imm], x
```
  We just set the value at imm on page 0 to any value imm' different than imm, and execute the trampoline again;
- If none of these values (ac, x, and y) have changed, then we know the instruction must be of the following forms:
  Code: Select all
```
ld addr
ld [addr]
ld [y, addr]
ld 0, x
ld [0], x
ld 0, y
ld [0], y
```
  where addr is the address of the LUP. We can then try to modify the value at [0], or at [addr], or at [1, addr], to rule out some cases, or set x = 2 and y = 1 before the trampoline, to rule out the other cases. In fact, we can do all of this at the same time: set byte at [0] to some value, byte [addr] to some other value, and byte [1, addr] to yet another value, set x = 2 and set y = 1 and do a trampoline.

I understand that there is a big penalty in having to write to page 0, which is a very special page, precious to many applications. For this to be worth the trouble, we need to copy the contents of page 0 to some other location, do a bunch of LUPs in sequence (there is not much to do in between LUPs, except for reverting the writes we did to figure out the instruction), and then copy back the contents to page 0;

If we push this idea to its extreme, we can still add some more instructions to our encoding scheme:

Code: Select all

    adda imm
    xora  imm
    adda [imm]
    suba [imm]
    xora  [imm]
    adda [y, imm]
    suba [y, imm]
    xora [y, imm]
    adda [imm], x
    suba [imm], x
    xora [imm], x
    adda [imm], y
    suba [imm], y
    xora [imm], y
    st [imm]
    st [y, imm]
    st [imm], x
    st [imm], y

In this case, we would have 25 * 256 = 6400 possible values, which amounts to log(6400) ~ 12.64 bits of information. That way we can extend the storage space by 4.64 / 8 ~ 58% which amounts to 34k extra bytes!

Note that I did not include suba imm, because it is impossible to distinguish it from adda imm (since for instance the effect of suba 2 is identical to the effect of adda 0xFE). In addition, I did not include the anda and ora operations, because depending on the value of ac (which we do not control), it might not be possible to extract the value of imm. Figuring out and recovering which instruction was stored in ROM might be a lot of work, since we have to analyze several possible cases and maybe execute the trampoline multiple times, but in theory, it is possible to do.

Maybe the sensible thing to do here is to to keep a fast version of LUP which returns only 8 bits of information per address, but also add a slow version LUP2, which might return more bits of information at the cost of being possibly much slower.

Post by **at67** » 15 May 2021, 00:43

I'm not really completely understanding your train of thought here, so I'll spell out the way I understand Trampolines, LUP, ROM space etc.

LUP is a vCPU instruction that is used by vCPU code to access look up tables in ROM, currently the only LUT's in ROM accessed by vCPU code are the Font tables at 0x0700/0x0800, (ROM space), and the Notes table at 0x0900.

There is a Shift table at 0x0500 that is used by the native interpreter and 6502 interpreter for right shift instructions, this table is accessed using a slightly different mechanism to LUP, (which explains why there are different types of trampolines).

There is an inversion table at 0x0a00 that is used by the inbuilt Racer app, this table is also accessed using a slightly different mechanism to LUP.

There are packed pictures stored in a 24:18 bit lossless compression format stored at 0x1300 that also use their own trampoline version, (though I haven't looked at this trampoline in detail as I removed the pictures app and it's data from ROMvX0 to make room for the new instructions and Sys routines).

So LUP represents a small percentage, (under 10%), of the actual ROM access from vCPU, the majority of, (pre ROMvX0), ROM space is used in storing the packed pictures and the vCPU applications. The packed pictures data and vCPU applications are stored at 50% efficiency as you noted at the start of your post, there is no way to increase this efficiency unless you actually compress the packed pictures data and vCPU applications streams using some sort of traditional lossless compressor/encoder, then the decoder would obviously have to live in ROM to be able to decompress these compressed streams on the fly before the data could be loaded from ROM into RAM.

If I understand your thought process correctly, then the potential efficiency increase, (assuming LUP accesses approximately 10% of the ROM space, I think it's actually much less than 10%), is then 0.1 * what you predicted.

LUP also currently takes 26 cycles/slots in ROMv5a/DEVROM and below, which leaves you with 2 slots, (in native code), to make whatever changes are required to implement your idea...that's not a lot.

ROMvX0 has increased the max cycles/slots per vCPU instruction from 28 to 30 and added the PREFX instruction, (thanks to lb3361), which potentially allows you to do a lot more with LUP or a LUP variant, but I still don't see the advantage in increasing LUP's efficiency. Personally I would write traditional lossless and lossy, (for image data), encoders on the host side and decoders in native code/Sys calls on the Gigatron side.

P.S. I'm actually removing images and vCPU apps from ROM space in ROMvX0, as I feel it's a massive waste of valuable ROM which can be used for new vCPU instructions and fully sick Sys calls, there is so many ways to get vCPU app's and data into Gigatron HW now, I personally think it's a travesty to waste the one resource the Gigatron has an abundance of, (ROM space), to pre-store vCPU applications and images in ROM.

hnaves · Post by **hnaves** » 15 May 2021, 03:27

Maybe I was not so clear, so I will try to explain a little further.

at67 wrote: ↑15 May 2021, 00:43 So LUP represents a small percentage, (under 10%), of the actual ROM access from vCPU, the majority of, (pre ROMvX0), ROM space is used in storing the packed pictures and the vCPU applications.

As far as I understand the code of the ROM, SYS_Exec_88 is implemented using LUPs. The following listing of sys_Exec in dev.lst shows this point:

Code: Select all

                                          5165  # SYS_Exec_88 implementation
                                          5166  label('sys_Exec')
sys_Exec:     124b d617  st   [$17],y     5167  st([vPC+1],Y)                   #18 Clear vPCH and Y
              124c 011c  ld   [$1c]       5168  ld([vSP])                       #19 Place ROM loader below current stack pointer
              124d a037  suba $37         5169  suba(53+2)                      #20 (AC -> *+0) One extra word for PUSH
              124e d21d  st   [$1d],x     5170  st([vTmp],X)                    #21
              124f 80fe  adda $fe         5171  adda(-2)                        #22 (AC -> *-2)
              1250 c216  st   [$16]       5172  st([vPC])                       #23
                                          5173  # Start of manually compiled vCPU section
              1251 dc75  st   $75,[y,x++] 5174  st('PUSH',    [Y,Xpp])          #24 *+0
              1252 dccf  st   $cf,[y,x++] 5175  st('CALL',    [Y,Xpp])          #25 *+26 Fetch first byte
              1253 8023  adda $23         5176  adda(33-(-2))                   #26 (AC -> *+33)
              1254 de00  st   [y,x++]     5177  st(           [Y,Xpp])          #27 *+27
              1255 dc5e  st   $5e,[y,x++] 5178  st('ST',      [Y,Xpp])          #28 *+3 Chunk copy loop
              1256 dc27  st   $27,[y,x++] 5179  st(sysArgs+3, [Y,Xpp])          #29 *+4 High-address comes first
              1257 dccf  st   $cf,[y,x++] 5180  st('CALL',    [Y,Xpp])          #30 *+5
              1258 de00  st   [y,x++]     5181  st(           [Y,Xpp])          #31 *+6
              
              ...
              
              1283 dc93  st   $93,[y,x++] 5226  st('INC',     [Y,Xpp])          #74 *+44
              1284 dc25  st   $25,[y,x++] 5227  st(sysArgs+1, [Y,Xpp])          #75 *+45
              1285 dc21  st   $21,[y,x++] 5228  st('LDW',     [Y,Xpp])          #76 *+46 Read next byte from ROM table
              1286 dc24  st   $24,[y,x++] 5229  st(sysArgs+0, [Y,Xpp])          #77 *+47
              1287 dc7f  st   $7f,[y,x++] 5230  st('LUP',     [Y,Xpp])          #78 *+48
              1288 dc00  st   $00,[y,x++] 5231  st(0,         [Y,Xpp])          #79 *+49
              1289 dc93  st   $93,[y,x++] 5232  st('INC',     [Y,Xpp])          #80 *+50 Increment read pointer
              128a dc24  st   $24,[y,x++] 5233  st(sysArgs+0, [Y,Xpp])          #81 *+51
              128b dcff  st   $ff,[y,x++] 5234  st('RET',     [Y,Xpp])          #82 *+52 Return
                                          5235  # Return to interpreter
              128c 1403  ld   $03,y       5236  ld(hi('REENTER'),Y)             #83
              128d e0cb  jmp  y,$cb       5237  jmp(Y,'REENTER')                #84
              128e 00d4  ld   $d4         5238  ld(-88/2)                       #85

Note that some vCPU code is first loaded in the stack space (page 0) and then executed by the vCPU to implement the system call. The vCPU code relies on the LUP instruction to read from ROM and use the trampoline (see address 0x1287). In addition, as I mentioned in the original post, the trampoline() function defined in asm.py (which is by far the most common trampoline) has the return address of "lupReturn#19". This further shows that LUP is indeed at the heart of most trampolines and data loading from ROM. So if SYS_Exec_88 depends on LUP, then not only 10% of the ROM is dedicated for LUPs, but more like 90% of the ROM.

Anyway, my point was not about the LUP instruction per se, it was more about how to waste less space when storing data in ROM. Regardless on whether or not we use LUP to get the data back from the ROM, there are other storage schemes that allow more information to be stored in ROM.

at67 wrote: ↑15 May 2021, 00:43 The packed pictures data and vCPU applications are stored at 50% efficiency as you noted at the start of your post, there is no way to increase this efficiency unless you actually compress the packed pictures data and vCPU applications streams using some sort of traditional lossless compressor/encoder, then the decoder would obviously have to live in ROM to be able to decompress these compressed streams on the fly before the data could be loaded from ROM into RAM.

I am not proposing using some (lossless) compression algorithm to pack more data into ROM. I am proposing a way to infer the contents of IR (instruction register) based on the effects of the instruction that is the target of the trampoline. Every instruction in ROM can be decomposed in the IR and D parts . Currently we only use the contents of the D register to store the data (and hence the 50% waste). But by analyzing the effects of the instruction, we can also recover part of the contents of the IR register as well. The whole point of my post was to show that it is indeed possible to increase the efficiency of the storage to close to 75% this way, and with no use of compression.

at67 wrote: ↑15 May 2021, 00:43 LUP also currently takes 26 cycles/slots in ROMv5a/DEVROM and below, which leaves you with 2 slots, (in native code), to make whatever changes are required to implement your idea...that's not a lot.

We do not need to perform all the operations involved in the new LUP instruction in one go. We can split the workload into "bursts", similar to what SYS_SetMemory_54 does. When you invoke SYS_SetMemory_54, if it cannot finish during the time it has available, it self-restarts later when there is more free time. LUP can also work this way. It will be slower, but at least we will get more data out of it.

qwertyface · Post by **qwertyface** » 15 May 2021, 03:51

It's an interesting idea, but I'm not sure that it works in practice. For one thing, you can't really get out of the page without setting y. You also can't practically write much native code without writing to page zero. Bear in mind that the packing density also depends on the amount of the page given over to the trampoline code, which is by necessity in each page - you need this to be simple.

It would be interesting to see an implementation. In my experience writing native code is always rewarding.

Post by **at67** » 15 May 2021, 06:22

hnaves wrote: ↑15 May 2021, 03:27 Maybe I was not so clear, so I will try to explain a little further.
As far as I understand the code of the ROM, SYS_Exec_88 is implemented using LUPs. The following listing of sys_Exec in dev.lst shows this point:

Yup, I wasn't clear myself, in ROMvX0, there is going to be much less vCPU code stored in ROM as it is a precious commodity that IMHO should be filled with new vCPU instructions and especially new native Sys calls. Sys calls can get rather large, some easily using half a page, (128 words).

I envision as ROMvX0 progresses, the need for more ROM space will only grow; I have only just scratched the surface with regards to Sys calls and so far I have written these:

Code: Select all

 "SYS_Multiply_s16_vX_66"       : "0x009e",
 "SYS_Divide_s16_vX_80"         : "0x00a1",
 "SYS_DrawLine_vX_86"           : "0x00a4",
 "SYS_WaitVBlank_vX_28"         : "0x00aa",
 "SYS_SpritePattern_vX_134"     : "0x00b6",
 "SYS_SortUint8Array_vX_52"     : "0x00b9",
 "SYS_SortSprites_vX_62"        : "0x00bc",
 "SYS_SortViaIndices_vX_44"     : "0x00bf",
 "SYS_MemCopyByte_vX_40"        : "0x00c2",
 "SYS_MemCopyWord_vX_48"        : "0x00c5",
 "SYS_MemCopyDWord_vX_58"       : "0x00c8",
 "SYS_DrawVLine_vX_66"          : "0x00cb",
 "SYS_DrawSprite_vX_132"        : "0x00ce",
 "SYS_DrawBullet_vX_140"        : "0x00d1",
 "SYS_CmpByteBounds_vX_54"      : "0x00d4",
 "SYS_Divide_u168_vX_56"        : "0x00d7",
 "SYS_ReadPixel_vX_32"          : "0x00da",
 "SYS_DrawPixel_vX_30"          : "0x00dd",
 "SYS_Multiply_u8_vX_48"        : "0x1300",
 "SYS_OffsetVTableY_vX_36"      : "0x1340",
 "SYS_FillByteSeq_vX_36"        : "0x1360",
 "SYS_AddInt8Array_vX_40"       : "0x1380",
 "SYS_ParityFill_vX_44"         : "0x13a0",
 "SYS_ConvertVTableX_66"        : "0x1fc0",
 "SYS_DrawSpriteH_vX_140"       : "0x2000",
 "SYS_ScrollVTableY_vX_38"      : "0x20c0",
 "SYS_RestoreSprite_vX_126"     : "0x2100",
 "SYS_ScrollRectVTableY_vX_44"  : "0x21a0",

hnaves wrote: ↑15 May 2021, 03:27 I am not proposing using some (lossless) compression algorithm to pack more data into ROM. I am proposing a way to infer the contents of IR (instruction register) based on the effects of the instruction that is the target of the trampoline. Every instruction in ROM can be decomposed in the

Yeah I got that bit, I was trying to suggest that I personally would come up with traditional compressors/decompressors if I was going to be storing large quantities of data in ROM. You would get better lossless compression and much better lossy compression.

hnaves wrote: ↑15 May 2021, 03:27 We do not need to perform all the operations involved in the new LUP instruction in one go. We can split the workload into "bursts", similar to what SYS_SetMemory_54 does. When you invoke SYS_SetMemory_54, if it cannot finish during the time it has available, it self-restarts later when there is more free time. LUP can also work this way. It will be slower, but at least we will get more data out of it.

You have 2 slots to restart vPC using LUP, restarting an instruction/Sys call requires 3, so yes you would need to use a Sys call rather than the LUP instruction itself or one of the new PREFX pages in ROMvX0.

As qwertyface suggested, writing the actual native code isn't just a rewarding experience, but also tends to prove/disprove the validity of your idea pretty quickly.

hnaves · Post by **hnaves** » 15 May 2021, 20:53

qwertyface wrote: ↑15 May 2021, 03:51 It's an interesting idea, but I'm not sure that it works in practice. For one thing, you can't really get out of the page without setting y. You also can't practically write much native code without writing to page zero. Bear in mind that the packing density also depends on the amount of the page given over to the trampoline code, which is by necessity in each page - you need this to be simple.

It would be interesting to see an implementation. In my experience writing native code is always rewarding.

I wrote a proof of concept to extend the result of the LUP instruction by 1 bit. That is, instead of returning a byte (8 bits), it returns 9 bits at once while using the same number of addresses within ROM. This LUP is compatible with the old one and all the other applications should work as expected.

I have attached the patch file with the source code of the experiment. To apply this patch please run the following command on the root of the gigatrom-rom repository:

Code: Select all

git am -3 0001-Extended-bit-LUP-experiment.patch.txt

In addition, I also wrote a vCPU application that makes use of this new LUP.
See attached file

Secret.gt1: (583 Bytes) Downloaded 149 times

The output of the vCPU application Secret.gt1 depends on the "include_hidden" flag defined on line 5394 of Core/dev.asm.py.

Code: Select all

#-----------------------------------------------------------------------
#
#  $1300 ROM: Test data
#
#-----------------------------------------------------------------------

align(0x100, size=0x100)
original_message = "Hello, everyone!!! This is a harmless message!!!"
hidden_message = "SECRET"
include_hidden = True

If the ROM is compiled with this flag set to True, the application is able to pick up the extra bit per byte, and shows the secret message.

: with.png (8.47 KiB) Viewed 2912 times

Otherwise, when include_hidden is set to False, it only displays the regular message.

: without.png (8.23 KiB) Viewed 2912 times

hnaves · Post by **hnaves** » 18 May 2021, 21:41

I realized that I forgot to also upload the source code of the Secret.gt1 file. Here it is.

PS: Do you know why the forum does not accept .py files?

Post by **walter** » 19 May 2021, 11:34

That sounds like a security precaution to prevent files that are being uploaded from being run by the webserver.

Gigatron Hackers

Expanding ROM space

Expanding ROM space

Re: Expanding ROM space

Re: Expanding ROM space

Re: Expanding ROM space

Re: Expanding ROM space

Re: Expanding ROM space

Re: Expanding ROM space

Re: Expanding ROM space