Interpreter speed versus video mode

marcelk · Post by **marcelk** » 06 Jul 2019, 16:32

vCPU schedules its instructions in the dead time between horizontal sync pulses surrounding pixel-less scanlines. Depending on the scanline type, 100-148 cycles are available for this purpose (out of 200 cycles per scanline). The remaining cycles run instructions to keep the video signal alive, and the sound channels updated.

[BTW, yesterday Ben Eater published a nice video tutorial on VGA generation from TTL. It somewhat resembles one of our Gigatron pre-studies, except for the chip count.]

Our four video modes shift the balance between vCPU lines and pixel lines: Mode 0 has zero pixel-less scanlines for every 4 VGA lines and is the slowest, because the only vCPU time available is during the vertical blank interval. Mode 3 has three dark lines for every pixel line, and is therefore the fastest. There are diminishing returns: going completely black would win another 30%, but we don't have a mode for that today.

We can also ignore video completely. To test this I hacked in a mode where the vCPU time slices are back-to-back, without any sync pulse generation in-between. For technical reasons, we still need time slices, but I made them as long as possible: 268 cycles. It skips all video, sound, serial input, blinkenlight updates etcetera. In other words, a total "zombie" mode, because there's no way to get out of this other than waiting until the program does it for you (while hoping that it doesn't crash before). In the meantime the computer looks like it was bricked.

Here are the results in Mandelbrot, with my measured results plotted as solid blue bars. The non-existing "mode 4" is estimated and plotted as a hatched bar labeled "All black video?".

: Speedup.png (114.68 KiB) Viewed 10251 times

This is still not the fastest that interpreted programs could possibly go, but it gets close. The vCPU instructions still perform time keeping that has no purpose anymore. And there is a small inefficiency at the end of every time slice, where a full instruction doesn't fit in. The estimated combined effect of stripping those from the interpreter is plotted as "no time keeping"

Native code will be much faster again, even when mostly manipulating 16-bit values. I don't dare to estimate, because it's not just the fetch and dispatch overhead that disappears: native code can optimise and cut many corners at the application level.