Possible ways of speeding up the Gigatron
Posted: 29 Mar 2021, 14:53
1. Use a faster clock rate, lower latency chips, more compact board.
2. Separating the video. This change could allow the core unit to be clocked faster than the video. I may use some type of DMA and make the controller indirection table aware in my own variation. That should keep compatibility. There could be artifacts, though there are likely hardware and software methods to prevent this. One workaround is to add HW interrupts so the sound and keyboard will work. It may be better to do all the I/O in hardware the way the Gigatron does it, but using concurrent DMA. So you keep the time relationships between the video, sound, and keyboard by doing all of that using hardware DMA.
3. Reducing colors. An 8, 4, or 2 color mode with supporting hardware would allow more processing time.
4. ISA changes. Adding a secondary control unit and ALU would sometimes allow two instructions to be done simultaneously. There could be a ROM block copy command where up to 512 bytes of ROM could be copied to RAM starting at Y:X. That would only affect initialization speed, but it would improve density. Additional instructions could make code denser and faster too. Proper shifts could make code faster, as would a proper carry flag. The above idea about the secondary CU/ALU could allow 16-bit additions in 2 cycles. Each half could add each half in a single cycle, and a fixup instruction could be added to deal with any carry (or borrow). Even a couple of 16-bit registers could be added. Redoing the opcode map to prevent chained decoding might allow for less control unit latency.
5. A couple of 16-bit memory instructions could help. There are several ways to do this. Without a more sophisticated memory controller or wait states, one could code to avoid race conditions. So if the RAM is tied up in a 16-bit operation needing 2 cycles, you can use a non-RAM instruction or a NOP immediately after the word memory instruction. You could effectively do 16-bits in a single cycle by hiding the second transfer with a concurrent instruction. Or, if designing in FPGA, you could have 2 interleaved 32K channels that could work simultaneously but are designed to work within an 8-bit wide memory map. This would also take unaligned accesses into account, though possibly with a slight skew (due to needing a pre-increment on one of the channels). Looking at some of the variable locations, I notice that the Gigatron uses a lot of unaligned memory accesses (though it is not a concern when using a true 8-bit architecture). This latter way to get 16-bit accesses would make the ROM block copy above work better. Mixed-mode instructions could work in this context, too, such as read an address and write to the next in sequence, or vice-versa. Another way to achieve this would be to use a free DMA channel to do half of the transfer.
6. A faster ALU could help when increasing clock rates. One low-hanging change is to replace the upper adder with two adders and a multiplexer. So all three adders work simultaneously, and the carry puts to appropriate high nybble on the bus. That is faster than a carry triggering an additional addition. So there would be less latency when settling after a carry. A switch is faster than an adder with the carry line enabled. A distributed ALU might also reduce latency. Logic gates could be used for logic functions.
7. Not clobbering the accumulator on every operation could also help in that you wouldn't have to reload Ac if you need it a bit later.
8. Adding a coprocessor is another possible strategy. I considered creating a VN core to run out of RAM while the Gigatron core uses the ROM. However, I am not sure how to avoid possible software races that could cause stuttering sound, visual artifacts, etc. Perhaps some memory writes could trigger a halt until Vsync. You could use other means to prevent sound and video from getting out of sync.
2. Separating the video. This change could allow the core unit to be clocked faster than the video. I may use some type of DMA and make the controller indirection table aware in my own variation. That should keep compatibility. There could be artifacts, though there are likely hardware and software methods to prevent this. One workaround is to add HW interrupts so the sound and keyboard will work. It may be better to do all the I/O in hardware the way the Gigatron does it, but using concurrent DMA. So you keep the time relationships between the video, sound, and keyboard by doing all of that using hardware DMA.
3. Reducing colors. An 8, 4, or 2 color mode with supporting hardware would allow more processing time.
4. ISA changes. Adding a secondary control unit and ALU would sometimes allow two instructions to be done simultaneously. There could be a ROM block copy command where up to 512 bytes of ROM could be copied to RAM starting at Y:X. That would only affect initialization speed, but it would improve density. Additional instructions could make code denser and faster too. Proper shifts could make code faster, as would a proper carry flag. The above idea about the secondary CU/ALU could allow 16-bit additions in 2 cycles. Each half could add each half in a single cycle, and a fixup instruction could be added to deal with any carry (or borrow). Even a couple of 16-bit registers could be added. Redoing the opcode map to prevent chained decoding might allow for less control unit latency.
5. A couple of 16-bit memory instructions could help. There are several ways to do this. Without a more sophisticated memory controller or wait states, one could code to avoid race conditions. So if the RAM is tied up in a 16-bit operation needing 2 cycles, you can use a non-RAM instruction or a NOP immediately after the word memory instruction. You could effectively do 16-bits in a single cycle by hiding the second transfer with a concurrent instruction. Or, if designing in FPGA, you could have 2 interleaved 32K channels that could work simultaneously but are designed to work within an 8-bit wide memory map. This would also take unaligned accesses into account, though possibly with a slight skew (due to needing a pre-increment on one of the channels). Looking at some of the variable locations, I notice that the Gigatron uses a lot of unaligned memory accesses (though it is not a concern when using a true 8-bit architecture). This latter way to get 16-bit accesses would make the ROM block copy above work better. Mixed-mode instructions could work in this context, too, such as read an address and write to the next in sequence, or vice-versa. Another way to achieve this would be to use a free DMA channel to do half of the transfer.
6. A faster ALU could help when increasing clock rates. One low-hanging change is to replace the upper adder with two adders and a multiplexer. So all three adders work simultaneously, and the carry puts to appropriate high nybble on the bus. That is faster than a carry triggering an additional addition. So there would be less latency when settling after a carry. A switch is faster than an adder with the carry line enabled. A distributed ALU might also reduce latency. Logic gates could be used for logic functions.
7. Not clobbering the accumulator on every operation could also help in that you wouldn't have to reload Ac if you need it a bit later.
8. Adding a coprocessor is another possible strategy. I considered creating a VN core to run out of RAM while the Gigatron core uses the ROM. However, I am not sure how to avoid possible software races that could cause stuttering sound, visual artifacts, etc. Perhaps some memory writes could trigger a halt until Vsync. You could use other means to prevent sound and video from getting out of sync.