Suggestions/ideas for my possible Gigatron-similar machine?

General project related announcements and discussions. Events, other retro systems, the forum itself...
Post Reply
Sugarplum
Posts: 93
Joined: 30 Sep 2020, 22:19

Suggestions/ideas for my possible Gigatron-similar machine?

Post by Sugarplum »

I still haven't started my Verilog CPU project. I'd like to make something that can run vCPU code, but with a different architecture. I'm looking for comments on various aspects or any considerations.

Separation of video/sound/keyboard using DMA -- This will use the redirection table for the video. The sound, keyboard, and likely other I/O will also be done in hardware so everything can have direct access to the syncs and thus avoid hardware race conditions. There could be ways to mitigate software race conditions such as creating a CPU halt line. Line quadding will be done in hardware, possibly using a BRAM buffer, so the SRAM is only read 1:4 native scan lines.

6-bit sound -- Since Blinkenlights are not planned, 6-bit sound would be feasible. The custom hardware would work during the DMA time and would use the same memory locations as now. I'd like to try to mux this on the color lines. Demuxing might be done using 2 multiplexer sets to decide what gets selected and what gets blanked or muted. Hopefully, interference won't be a problem. I'm unsure about the best way to do this. I'd like to try this to save GPIO pins.

Watchdog/snooper unit and halt line -- To avoid software races caused by having all the I/O done in hardware, there likely needs to be a halt line to pause the CPU. In a naive compatibility mode, one could activate the halt line during active display time. That way, things will be slightly faster than a regular Gigatron in Mode 4 due to dedicated hardware doing things that currently use active processing time. So if there is a hardware RNG, cycles are not used to create random numbers. Not having Blinkenlights saves cycles. Hardware sound saves cycles.

However, being less naive, it could snoop the address lines to know when I/O is being updated and selectively halt for so many active scanlines. There is precedence for this in things such as Apple accelerator cards or the 100 Mhz FPGA 6502 board that does everything internally at 100 Mhz, shadows the entire ROM and RAM into BRAM, and does board traffic at bus speed. For I/O region writes, it writes to both BRAM and DRAM. I'm not sure about I/O region reads, but I guess it would write to the BRAM as it uses the data. So if sound registers, frame buffer, indirection table, or other I/O areas on the Giga-similar machine are written to, the watchdog/snooper can selectively halt the CPU.

Advanced memory unit/arbiter -- This unit should be able to use 10ns SRAM, make it synchronous, give it 2-3 "ports," allow 16-bit SRAM transfers using the best available method (do during the next cycle, do during unused video cycle or sync time, or halt the CPU)

"Microcode" store(s) -- This is to make it easier to do vCPU operations and allow the designer to arbitrarily assign them. It could have its own PC to not disturb the main native one, and jumps could be in relation to this PC. There could be an opcode to execute the vCPU opcode located at vPC with the microcode loading any operands.

In a way, this could count as microcode, since the control unit LUT I'd likely use in place of the CU would contain picocode.In technical parlance, microcode runs instructions whereas picocode deals with the lowest level of controlling I/O and ALU functions to make instructions. I would use LUTs for both.

I don't know how to do this, but it would be nice to have an immediate version that uses the registers so native code can execute these. Maybe the CU can alias the registers as vPC[?] to be able to use a single microcode store for that and the main vCPU execution instruction.

One beauty of using a microcode store this way is that if I wanted to, I could access extra native instructions since BRAM is 9-bits wide. Things like a Return to main PC instruction or a Jump to different vCPU opcode instruction might help. Or, if there is a secondary unit, the extra bit could make things target the secondary instructions if a Secondary Execution Idea is used.

But I don't know how to handle syscalls. It makes little sense to jump from one microcode store to another when it can just go there, and that would take an extra cycle. Then again, I could make that a part of the instruction so if calling the syscall instruction, that code store can be called instead.

A question, how many native instructions would be good per vCPU/v6502 instruction? I mean, how much space should be reserved per vCPU/v6502 instruction (in even powers of 2)? I ask since the best way to address it would be to use the low bits for the local address, higher bits for the instruction number, and any higher bits than that to determine which microcode store (such as 2 to have 2 CPUs and a system call store, with address space for one more).

New native instructions -- These could include 16-bit memory instructions, additional registers, additional memory modes, hardware shifts, carry flag and instructions, and a single-cycle RNG instruction. It would be nice if the vCPU had actual registers. On the RNG, maybe it could work via register (AC?) and via RAM for compatibility. Having a dedicated hardware RNG would make up for having more time to run user code than during the porches.

What native instructions would you find more helpful in improving code density?

Secondary Execution Unit -- It would be nice to use the operand space for additional instructions when encountering instructions that use no operands. But what should those instructions be? What instruction pairs could be done at the same time that would speed up the vCPU? The secondary EU would need its own instruction set. I would use 0 for its NOP and it likely should not have Jump instructions. Instead of Jumps, it should have predicated instructions. It probably should have no port instructions except maybe for an optional separate port.

In designing a secondary unit, I'd probably want to triple-port the registers to make it easier to allow both units to use the same registers. The Data registers likely should be triple-ported to help reduce the critical path. That way it can be available as an operand and an instruction at the same time and be decoded as an instruction regardless, with the ALU of the secondary unit being gated. Since I'd probably want to have a BRAM "ROM"-based decoder, it would likely save time to decode for it and have a line coming from the decoder to determine if the secondary ALU uses what is decoded or not.

A consideration here is to not remove all the unused constant functions in the main core. Using those instead of immediate constants will give more opportunities to use the slave core.

Possible ROM block copy opcode -- That could be used to increase data storage density. It could use Y:X as the starting destination address and the operand could be the number of words to copy. I am not sure if I should have a compressed version or not, since that would be good for storing bitmaps in ROM. Then 680 bytes could be stored in 510 bytes in 255 addresses.
Last edited by Sugarplum on 03 Aug 2021, 08:18, edited 1 time in total.
Sugarplum
Posts: 93
Joined: 30 Sep 2020, 22:19

Re: Suggestions/ideas for my possible Gigatron-similar machine

Post by Sugarplum »

Another idea comes to mind. What if there were a way to drive the ALU (and possibly the CU) at twice the main frequency? Then that could allow up to 4 native instructions (assuming multiple ALUs). That could allow hardware blitting and 16-bit ops (due to having 2 RAM cycles per ROM cycle). That would mean changing the ISA to have more complex instructions. In a sense, this could be time-based instruction compression. Of course, it would make the CU more complex, since it would mean having control signals for both "phases." Pipelining could work the same. So the IR/DR would be loaded during each cluster and run during the next. So it could do multiple micro-ops per PC update.

While FPGA seems to be the best way to play with the above, it could likely be done in a wired configuration and allow for both bit-banged video and computations together.
Post Reply