ROM adventures (dev7rom)
Forum rules
Be nice. No drama.
Be nice. No drama.
Re: ROM adventures (dev7rom)
Thanks for your work
Re: ROM adventures (dev7rom)
Shaving cycles.
Maybe inspired by Hans61 who solders for relaxation, I spend time here and there shaving cycles in dev7 rom, sometimes taking advantage of more rom code to speedup instructions (e.g. Bcc/ADDI/SUBI), sometimes tweaking the contents of the vCPU page, page 3, to recover some of the cycles lost between ROMv4 and ROMv5a (e.g. LD/ANDI/INC), sometimes recoding the new dev7 instructions (e.g. MACX, ADDL, LSLVL, LSLXA, LSRXA). However some instructions remain necessarily slower because they have added features (e.g. 16 bit stack POP, PUSH, LDLW, STLW) or because they have been moved out of page 3 (CALL/SUBW), making space for other instructions and also providing opportunities to speed up some old instructions. A complicated landscape.
Here are the results for ascbrot.gt1
This is a floating point heavy program that runs much faster when compiled for rom dev7. Yet the cycle shaving changes yield a 10% speedup which is not insignificant. I was also quite happy that dev7 rom now runs programs compiled for roms v5a/v6 about 3% faster than roms v5a/v6 themselves. The execution times of the program compiled for ROMv4 are very telling as well.
Maybe inspired by Hans61 who solders for relaxation, I spend time here and there shaving cycles in dev7 rom, sometimes taking advantage of more rom code to speedup instructions (e.g. Bcc/ADDI/SUBI), sometimes tweaking the contents of the vCPU page, page 3, to recover some of the cycles lost between ROMv4 and ROMv5a (e.g. LD/ANDI/INC), sometimes recoding the new dev7 instructions (e.g. MACX, ADDL, LSLVL, LSLXA, LSRXA). However some instructions remain necessarily slower because they have added features (e.g. 16 bit stack POP, PUSH, LDLW, STLW) or because they have been moved out of page 3 (CALL/SUBW), making space for other instructions and also providing opportunities to speed up some old instructions. A complicated landscape.
Here are the results for ascbrot.gt1
Code: Select all
+-----------------------+---------+---------+---------+------------+-------------+
| ascbrot.gt1(mode 3) | ROMv4 | ROMv5a | ROMv6 | DEV7(2/23) | DEV7(11/23) |
+-----------------------+---------+---------+---------+------------+-------------+
| compiled for ROMv4 | 104.6s | 108.7s | 108.7s | 110.9s | 106.4s |
| compiled for ROMv5a | | 101.1s | 101.1s | 100.4s | 97.9s |
| compiled for ROMv6 | | | 101.1s | 100.4s | 97.9s |
| compiled for DEV7(*) | | | | 25.5s | 23.6s |
+-----------------------+---------+---------+---------+------------+-------------+
Re: ROM adventures (dev7rom)
Very intresting results. Thanks for job you have done.
Is the DEV7 ROM and DEV ROM same?
Is the DEV7 ROM and DEV ROM same?
Re: ROM adventures (dev7rom)
They're different:
- DEVROM in https://github.com/kervinck/gigatron-rom is functionally identical to ROMv6.
- DEV7ROM is in https://github.com/lb3361/gigatron-rom as a dozen commits ahead of DEVROM.
Whether code or ideas from DEV7ROM will make it into the official repository is in the air for lack of consensus.
Re: ROM adventures (dev7rom)
One of the key enablers for new vCPU instructions is the increase of MaxTicks pioneered by at67 (viewtopic.php?p=1995#p1995). Although this change has pervasive effects in the operation of the Gigatron, increasing MaxTicks from 14 to 15 ticks has been amazingly free of backward compatibility nightmares. Until last week, that is.
Background on MaxTicks --- The Gigatron ROM is essentially a loop that generates VGA and sound signals with precise timings. However, in many points of this loop, there is nothing to do for a known duration. These time slices are used to interpret vCPU opcodes. For instance, at the beginning of a blank scanline, there are about 148 cycles for vCPU opcode, or 74 ticks, with each tick equal to 2 cycle. Of course, when the ROM branches to the native code that implements a vCPU opcode, it must be certain that this code will "return" before the end of the time slice. This is why vCPU opcodes are only dispatched when at least MaxTicks ticks remain available in the time slice. With MaxTicks=14 as in ROMv5a, all vCPU opcodes must return it at most 28 cycles. This is not much because 10 of these cycles are already taken by the dispatching code, and 3-4 more are necessary if the vCPU opcode implementation is outside ROM page 3. Increasing MaxTicks really helps because it provides the elbow room to move vCPU opcode implementations around and add new ones. But increasing MaxTicks also means that there are more unused cycles at the end of each time slice. At67 found that increasing MaxTicks to 15 had practically no impact on the vCPU speed, but increasing to 16 would slow it by about 10%. This is why both ROMvX0 and DEV7ROM use MaxTicks=15.
Background on the SYS opcode --- The vCPU SYS opcode provides a way to execute native code that requires more than MaxTicks*2 cycles. For instance, the routine SYS_VDrawBits_134, which is used to draw characters on the screen, must be called with vCPU instruction SYS(134) which checks whether there are 134 remaining cycles in the current time slice. If not, it arranges to be called again by tweaking the vCPU program counter and returns immediately. The result is that the SYS(134) instruction is called again and again, until finding a long enough time slice.
SYS with MaxTicks=15 --- The argument of SYS(134) is not encoded as a cycle count, but as excess ticks required beyond MaxTicks. This means that SYS(134) with MaxTicks=14 is encoded as B4 CB, and SYS(134) with MaxTicks=15 is encoded as B4 CC. So when a ROM with MaxTicks=15 executes a program compiled for a ROM with MaxTicks=14, these SYS(134) encoded as B4 CB are executed as SYS(136). This does not seem too problematic because ensuring that there are 136 remaining cycles is enough to run a routine that takes at most 134 cycles.
Until last week.
When we runs the Gigatron in video mode 0, the slowest mode that displays all scanlines, the only remaining time slices are those occurring during the video vertical blanking interval. It turns out that the longest of these time slices is 134 cycles. So these SYS(134) opcodes compiled for MaxTicks=14 and interpreted as SYS(136) never find a time slice long enough to run. The Gigatron just waits. For instance, TinyBasic_v4.gt1, which was compiled for ROMv5a, works slowly but correctly with ROMv5a in video mode 0. However, on a MaxTicks=15 ROM operating in mode 0, it will simply hang until one changes the video mode.
If you care about backward compatibility, this is a problem. This is not one that is easy to fix. There is simply no cycle left in the code of the SYS instruction to correct its argument and normalize the way the instruction is encoded regardless of MaxTicks. After looking at this problem, I concluded that the only viable solution is to find a way to increase the length of at least some the vertical blanking time slices. When your only option is to find two free cycles in Marcel's incredibly tight code, you know you're in trouble.
I got lucky. Just before these 134 cycles time slices, there is code that tests variable videoY and decides whether to read the input to store in variable SerialRaw (this happens when videoY=207) and whether one needs to collect the audio samples (this happens when videoY&6 is zero). Instead of testing these both, we can notice that 207&6 is not zero. So if we need to read the input we don't need to test the audio condition. With some code reorganization, this gives the two cycles we need. However Marcel's code also used to write zero in memory location zero with instruction st(0,[0]) when not reading the input. Fortunately the previous bit of code contained a nop() and therefore gave another free cycle to do this as well. In the end, all seems to work.
With this patch (https://github.com/lb3361/gigatron-rom/ ... b18527ed8f), dev7rom offers slightly longer time slices during vertical blanking and runs old programs that prints characters in mode zero without hanging. This was a close one...
Background on MaxTicks --- The Gigatron ROM is essentially a loop that generates VGA and sound signals with precise timings. However, in many points of this loop, there is nothing to do for a known duration. These time slices are used to interpret vCPU opcodes. For instance, at the beginning of a blank scanline, there are about 148 cycles for vCPU opcode, or 74 ticks, with each tick equal to 2 cycle. Of course, when the ROM branches to the native code that implements a vCPU opcode, it must be certain that this code will "return" before the end of the time slice. This is why vCPU opcodes are only dispatched when at least MaxTicks ticks remain available in the time slice. With MaxTicks=14 as in ROMv5a, all vCPU opcodes must return it at most 28 cycles. This is not much because 10 of these cycles are already taken by the dispatching code, and 3-4 more are necessary if the vCPU opcode implementation is outside ROM page 3. Increasing MaxTicks really helps because it provides the elbow room to move vCPU opcode implementations around and add new ones. But increasing MaxTicks also means that there are more unused cycles at the end of each time slice. At67 found that increasing MaxTicks to 15 had practically no impact on the vCPU speed, but increasing to 16 would slow it by about 10%. This is why both ROMvX0 and DEV7ROM use MaxTicks=15.
Background on the SYS opcode --- The vCPU SYS opcode provides a way to execute native code that requires more than MaxTicks*2 cycles. For instance, the routine SYS_VDrawBits_134, which is used to draw characters on the screen, must be called with vCPU instruction SYS(134) which checks whether there are 134 remaining cycles in the current time slice. If not, it arranges to be called again by tweaking the vCPU program counter and returns immediately. The result is that the SYS(134) instruction is called again and again, until finding a long enough time slice.
SYS with MaxTicks=15 --- The argument of SYS(134) is not encoded as a cycle count, but as excess ticks required beyond MaxTicks. This means that SYS(134) with MaxTicks=14 is encoded as B4 CB, and SYS(134) with MaxTicks=15 is encoded as B4 CC. So when a ROM with MaxTicks=15 executes a program compiled for a ROM with MaxTicks=14, these SYS(134) encoded as B4 CB are executed as SYS(136). This does not seem too problematic because ensuring that there are 136 remaining cycles is enough to run a routine that takes at most 134 cycles.
Until last week.
When we runs the Gigatron in video mode 0, the slowest mode that displays all scanlines, the only remaining time slices are those occurring during the video vertical blanking interval. It turns out that the longest of these time slices is 134 cycles. So these SYS(134) opcodes compiled for MaxTicks=14 and interpreted as SYS(136) never find a time slice long enough to run. The Gigatron just waits. For instance, TinyBasic_v4.gt1, which was compiled for ROMv5a, works slowly but correctly with ROMv5a in video mode 0. However, on a MaxTicks=15 ROM operating in mode 0, it will simply hang until one changes the video mode.
If you care about backward compatibility, this is a problem. This is not one that is easy to fix. There is simply no cycle left in the code of the SYS instruction to correct its argument and normalize the way the instruction is encoded regardless of MaxTicks. After looking at this problem, I concluded that the only viable solution is to find a way to increase the length of at least some the vertical blanking time slices. When your only option is to find two free cycles in Marcel's incredibly tight code, you know you're in trouble.
I got lucky. Just before these 134 cycles time slices, there is code that tests variable videoY and decides whether to read the input to store in variable SerialRaw (this happens when videoY=207) and whether one needs to collect the audio samples (this happens when videoY&6 is zero). Instead of testing these both, we can notice that 207&6 is not zero. So if we need to read the input we don't need to test the audio condition. With some code reorganization, this gives the two cycles we need. However Marcel's code also used to write zero in memory location zero with instruction st(0,[0]) when not reading the input. Fortunately the previous bit of code contained a nop() and therefore gave another free cycle to do this as well. In the end, all seems to work.
With this patch (https://github.com/lb3361/gigatron-rom/ ... b18527ed8f), dev7rom offers slightly longer time slices during vertical blanking and runs old programs that prints characters in mode zero without hanging. This was a close one...