Implementing a faster multiplication routine

Post by **at67** » 04 Nov 2021, 11:00

Sugarplum wrote: ↑04 Nov 2021, 10:47 Speaking of the "weird instruction," [if (A != 0) PC = hi(PC)|A], that is $hEE. I found that on this site: https://www.iwriteiam.nl/PGigatron.html

That's just one of the BCC ac instructions, i.e. BNE ac, branches if ac !=0 in the current page to offset ac.

None of the ROM's, (including mine), use anything other than BRA ac.

Scroll to the bottom: https://hackaday.io/project/20781/logs? ... est&page=2

: 8607281496033313458.png (129.86 KiB) Viewed 2369 times

qwertyface · Post by **qwertyface** » 04 Nov 2021, 15:24

Sugarplum wrote: ↑04 Nov 2021, 10:47 Speaking of the "weird instruction," [if (A != 0) PC = hi(PC)|A], that is $hEE. I found that on this site: https://www.iwriteiam.nl/PGigatron.html

That's actually the site that I refer to all the time when writing native code, but for obvious reasons I tend to look at the smaller list of instructions that have actually proven useful in the ROM, rather than the big table at the top with every possibility. That's why I didn't see it, because it hadn't been used!

qwertyface · Post by **qwertyface** » 04 Nov 2021, 18:43

at67 wrote: ↑04 Nov 2021, 10:11 This sounds like Amdhal's Law paying a practical visit, optimising only one part of an algorithm can often have surprising results such as this. But when you sit down and think about what is going on, it starts to make sense.

I think it's very generous of you all to assume that my code didn't just have a catastrophic performance bug.

When you talk of Amdhal's Law, do you mean that a faster multiplication routine can only speed up the proportion of the program that is doing multiplication (and progressive improvements will make that proportion smaller, showing diminishing returns)?

This is something I knew intellectually, but as I said in reply to lb3361, at this stage of the story, I hadn't really measured or run the numbers.

My impression was that multiplication (rather than the CalcPixel loop, updating the clock, or clearing the screen) vastly dominates the runtime. I simply figured that a big improvement in the speed of multiplication should equate to a big difference in the total speed. For what it's worth, I think the first part is correct. I'll talk more about benchmarking and profiling in the next instalment, but at one point in Marcel's version I observed that CalcPixel took 90.24% of the total runtime. I'm pretty sure that must mostly be in MulShift7.

at67 wrote: ↑04 Nov 2021, 10:11 One of the things I learnt very early on in my career as a 3D rendering engineer was that batching of draw-calls based on common resources, (vertex buffers, textures, expensive 3d pipeline state switches, etc), was crucial in unburdening the CPU and keeping the GPU pegged as close as possible to 100% utilisation, (keeping the CPU as free as possible for everything else); and even though the Gigatron is a far cry from a modern CPU-GPU 3d pipeline, the basic concepts are still the same. You can think of vCPU as a manager and native code as the worker, it's just like a CPU-GPU symbiosis, i.e. if you try and do everything on the CPU, (aka in vCPU land). you will suffer from an unbalanced workload that could potentially be much better spread across both vCPU and native lands.
<snip>
One possible architecture, (which I have already designed and partially implemented in ROMvX0), is to vectorise/batch data and have self restarting native code SYS calls and instructions that process whole vectors, (arrays), of data completely in native land. So far I have implemented the basic arithmetic operators, (+, -, &, |, ^), and a generic boundary tester. What this means is that vCPU manages input and output arrays, whilst native code processes those arrays anywhere from 10x to 20x faster than vCPU could. An example of the speed possible is my bullet and sprites example that uses the vectored add and vectored boundary tester with vCPU callbacks to perform velocity processing, collision checks and reflections for 24 bullets and 8 sprites extremely efficiently at 30Hz in mode 2, (two scanline mode with 20 extra lines of vBlank): https://www.youtube.com/watch?v=mT-jkaNwsbM

Interesting reflection. I suppose the big difference is that you're not freeing up vCPU to do something else, rather just letting code run where it's fastest, more like using SIMD instructions in a CPU, than offloading things to the GPU. But I think I can see how you can change your perspective to see them as a two processors (and I wonder if a separate VM would make sense...), and either way the vectorised operations sound really interesting! The demo is really impressive too. I really cannot wait to see the code, maybe I should take the day off work when you publish it! I occasionally check your Github activity to see if it's online somewhere!

at67 wrote: ↑04 Nov 2021, 10:11 e.g. The number of vCPU instruction slots for ROMv5a and ROMvX0 available per 60Hz frame is approximately:
Code: Select all
            Scanlines:      4       3       2       1
vCPU slots:
  ROMv5a                   190     815     1425    2040     (*) MaxTicks=28
  ROMvX0                   175     755     1340    1920     (*) MaxTicks=30
This really hits home how little vCPU time per scanline and per 60Hz frame there is, for anything in any sort of loop, (i.e. blitting, memory copying, complex arithmetic, etc), you are usually going to be spending a large portion, (if not all), of a 60Hz frame just in that one loop.

Yes! Very eyeopening. I think I'd like to have a better understanding of how the time slices work, because I'm not completely clear on how common they all are.

Post by **at67** » 05 Nov 2021, 21:59

qwertyface wrote: ↑04 Nov 2021, 18:43 My impression was that multiplication (rather than the CalcPixel loop, updating the clock, or clearing the screen) vastly dominates the runtime. I simply figured that a big improvement in the speed of multiplication should equate to a big difference in the total speed. For what it's worth, I think the first part is correct. I'll talk more about benchmarking and profiling in the next instalment, but at one point in Marcel's version I observed that CalcPixel took 90.24% of the total runtime. I'm pretty sure that must mostly be in MulShift7.

This is a fantastic example to do some back of the envelope experiments with, lets start off with the best case scenario, (with slightly adjusted figures just to make it slightly simpler to deal with):
1) 90% CalcPixel, 10% other, CalcPixel is effectively an inline function whose body is purely the functionality you are replacing, i.e. CalcPixel resolves to just a multiply instruction and you effectively make it execute 10x faster.
Result: 10% + 90% x0.1 = 19%, (effective new total execution time), i.e. 10x partial becomes 5.263x total.

2) 90% CalcPixel, 10% other, CalcPixel is a function in which the multiply takes 80% of the execution time of CalcPixel, (a more realistic example).
Result: 90%x0.8 = 72%, 10% + (90%-72%) + 72%x0.1 = 35.2%, (effective new total execution time), i.e. 10x partial becomes 2.84x total.

3) 90% CalcPixel, 10% other, CalcPixel is a function in which the multiply takes 65% of the execution time of CalcPixel, (an even more realistic example).
Result: 90%x0.65 = 58.5%, 10% + (90%-58.5%) + 58.5%x0.1 = 47.35%, (effective new total execution time), i.e. 10x partial becomes 2.11x total.

You can see how quickly our expectations are shattered by the seemingly non linear change, (mostly explained by Amdahl's law), this used to catch me out constantly when I was optimising code back in the 80's and 90's, (specifically SW based 3D rendering engines). No matter how many times I thought my prediction of potential gains would be close, I would always be dismayed, (and sometimes horrified, based on time spent), to realise how far off my predictions had been. Once I started regularly measuring and applying a more rigorous approach to my expectations, (i.e. similar to the above approach), was the day my blood pressure halved.

The bottom line is, (or another way to think about it is): if you replace a single component of a complex whole with a more efficient version and obtain a total speedup, (of the complex whole), approaching 100%, you have basically performed a miracle! (or the original code was low hanging fruit

).

qwertyface wrote: ↑04 Nov 2021, 18:43 Interesting reflection. I suppose the big difference is that you're not freeing up vCPU to do something else, rather just letting code run where it's fastest, more like using SIMD instructions in a CPU, than offloading things to the GPU. But I think I can see how you can change your perspective to see them as a two processors (and I wonder if a separate VM would make sense...)

It makes perfect sense to treat them as separate processors to me, even though, (as you pointed out), they are merely two different modes of the same execution stream. It allows me to visualise the spread of workload in a typical application that is going to produce the most efficient results, (not necessarily efficient in most vCPU instruction slots used, but in total execution time, vCPU + native, spent per 60Hz frame).

Another way to visualise vCPU and native is as say an interpreted scripting language vs a compiled language as part of a greater whole, (e.g. say a games engine), typically you would have the scripting engine, (lets say Lua), be code that does initialisation, setup, coordination, (i.e. lightweight tasks), of your compiled language, (lets say C++), that usually does the heavy grunt work, (i.e. heavy data manipulation and processing output to a display).

The Lua code, (vCPU), manages and controls the C++, (native code), which produces an excellent balance/spread of work done across both execution models. In the gaming engine world this could potentially save oodles of time and also allow non skilled programmers to produce content at the programming level. In Gigatron land it allows us to potentially use available cycles much more efficiently, (in terms of total work done per time), by pushing the things that vCPU is bad at, to generic SYS calls and instructions in native land.

qwertyface wrote: ↑04 Nov 2021, 18:43 , and either way the vectorised operations sound really interesting! The demo is really impressive too. I really cannot wait to see the code, maybe I should take the day off work when you publish it! I occasionally check your Github activity to see if it's online somewhere!

Haaaa! It's getting close, (especially for the ROM).

qwertyface wrote: ↑04 Nov 2021, 18:43 Yes! Very eyeopening. I think I'd like to have a better understanding of how the time slices work, because I'm not completely clear on how common they all are.

I spent a lot of time gathering vCPU and native execution statistics across my own apps and everyone else's apps I had access to, using simple tools/functions I created as part of the emulator. That was when I realised just how bad vCPU was at looping and that it was typically, (much), better to have vCPU managing and native doing the looping.

qwertyface · Post by **qwertyface** » 06 Nov 2021, 18:04

Part 3: A benchmark

Thanks everyone for your interesting responses above. As I'm writing these retrospectively, some of the steps I'm going to describe taking in future will seem odd in light of the excellent points you've already made - and that's just because I didn't have the benefit of your wisdom at the time. I think you'll still find some of them quite interesting.

Actually though, I wasn't totally naïve about Ahmdal's law, or the point that Marcel's multiplication routine is computing fewer bits, and so might be faster than you'd think. I certainly did have an attitude of "why is it so slow?", but really the task I set myself was to see if it could be made any faster. It wasn't obvious (to me) that there wasn't any further low-hanging fruit, and I had two questions in particular:

Is my code uniformly faster across the board, or am I making big gains on some "expensive" pixels, and doing less well on other, "cheaper" pixels. After all, Marcel's code would be very fast when one value is zero! Does that happen? Does that happen often?
SYS_MultiplyBytes_120 is an unusually long SYS function, and one of the downsides of long SYS functions is that they might "stall" [not sure if we have a preferred term for this] if there isn't time for them to run. What happens when you call the same long SYS function four times with a few instructions between? Does stall add up to become a problem?

What do you think? Are the answers to these obvious?

I've never done a lot of optimisation, generally in my experience slow code has proven "fast enough", but I know enough to know that I ought to try to measure before making any changes. Unfortunately (as far as I know, please tell me if I'm wrong!) we don't yet have any tools for understanding the performance of vCPU code, so I would have to make whatever I needed, and I decided that what I needed to start with was a repeatable benchmark. There were two reasons for this:

As we've seen, further improvements in the performance of multiplication will lead to much smaller changes in the total runtime. Going forward, watching the onscreen clock isn't just going to be annoying, but also inaccurate unless the gains are absolutely huge.
Writing this would be a lot easier than writing the proper profiler I thought I'd need later, but would give me a lot of the building blocks I would need for that.

I didn't have great tools for running small parts of a vCPU program, especially not individual functions in a GCL program. So here's what I arrived at:

I'm assembling a ROM image in memory with my SYS function included, and
I'm including the v5 Reset, CardBoot and as the Main entry point I'm including the Mandelbrot program (the original or modified)
I'm compiling the Mandelbrot program a second time, to give myself programmatic access to the symbol table.
I'm running the ROM from a cold boot, right up to the entry point of the Mandelbrot program. This is unnecessary work, but ensures that vCPU is initialised properly, and represents less than a second of runtime.
Once there, I set a breakpoint in the CALL instruction. The first call is to the CalcSet routine, which is what I really want to measure. After allowing the instruction to complete, I capture vLR, and then step vCPU until I'm about to return to that address.

In other words, I count how many cycles it takes from first turning the (emulated) Gigatron on run up to after the first call to CalcSet. This includes quite a lot of stuff that isn't multiplication, such as clearing the screen at the end, updating the clock, and all of the bootup. I know there's benchmark settings in the code, but I wanted to closely replicate what would happen with the default settings (and I don't fully understand them).

Here's an example of the output

Code: Select all

    Running Original
    ================
    Boot reached SYS_RESET_88 in 1006781 cycles (0.16108496s)
    Boot reached SYS_Exec_88 (Reset) in 1006825 cycles (0.161092s)
    Boot reached SYS_Exec_88 (Main) in 2037968 cycles (0.32607488s)
    Program was ready to begin exectution after 3961067 cycles (0.63377072s)
    Program initialisation was complete after 3965895 cycles (0.6345432s)
    CalcSet was complete after 7439242573 cycles (1190.27881168s)

    Running Modified
    ================
    Boot reached SYS_RESET_88 in 1006781 cycles (0.16108496s)
    Boot reached SYS_Exec_88 (Reset) in 1006825 cycles (0.161092s)
    Boot reached SYS_Exec_88 (Main) in 2037968 cycles (0.32607488s)
    Program was ready to begin exectution after 3750173 cycles (0.6000276800000001s)
    Program initialisation was complete after 3755015 cycles (0.6008024s)
    CalcSet was complete after 3261656173 cycles (521.86498768s)

So there's my first data point - On this benchmark the runtime of my modified version is 43.84% of the original (or, I guess, 228% as fast, or faster by 128%, depending on your preferred terminology).

A note on implementation
I wasn't going to talk about it, but since you raised it...

at67 wrote: ↑05 Nov 2021, 21:59 Another way to visualise vCPU and native is as say an interpreted scripting language vs a compiled language as part of a greater whole, (e.g. say a games engine), typically you would have the scripting engine, (lets say Lua), be code that does initialisation, setup, coordination, (i.e. lightweight tasks), of your compiled language, (lets say C++), that usually does the heavy grunt work, (i.e. heavy data manipulation and processing output to a display).

The Lua code, (vCPU), manages and controls the C++, (native code), which produces an excellent balance/spread of work done across both execution models. [...]

Now this is an analogy I can appreciate. In fact this is an analogy I can feel. This (and all of the subsequent analysis) are done using my py-gtemu library, which as I've said before is basically a Python wrapper around the cpuCycle function from gtemu.c, with functions that loop to run a certain number of instructions or run up to a given address or label, including single-stepping vCPU, and a primitive implementation of breakpoints.

I wrote it to help me to unit test GIgatron code, where I'm running at most a few hundred instructions at a time, and for this it works great. But here, where I'm running billions, let me tell you it is slow. The first run of that benchmark, representing half an hour of execution? It took an hour and 38 minutes. Admittedly my machine is a bit old and was probably heavily loaded at the time, but unlike the Gigatron it does use components from this century. It discourages doing sensible measurements before making changes.

I made some changes later that I think improve things somewhat. I'm also using PyPy, which is hopefully JIT compiling my inner loops to remove some of the overhead of being Python (I never had the patience to see how long it takes in cpython). And there's more I can do (first of all switch to David Kolf's libgtemu, which I'd have used from the start, but I'd somehow missed out on it's existence). But all of this is basically making slow faster - the issue is exactly what you're describing: I have the "fast world" of C, and the "slow world" of Python, and the balance is wrong. Too much of the inner loops are in Python, but more importantly, those routines have the wrong structure; they step then check, step then check. It's too much "thinking", when they should be all breakneck "doing". I have a plan to fix it, although it's quite a lot of work, and I'm not certain it will succeed. Exactly as you're describing it involves having the slow outer layer manage the work of the fast inner layer. You could potentially even pass work through a queue and run them them in parallel, like two processors (although I'm not sure that would help in practice).

lb3361 · Post by **lb3361** » 08 Nov 2021, 00:46

Turns out I had a version of GLCC's gtsim with profiling support. I added a HALT after the first screen of Mandelbrot_v1. At that point, the total number of cycles spent executing vCPU instructions in this file is 1451210751. The total number of cycles spent in MulShift7 is 1300670415, that is 89% of the total. Far more than I expected.

In the following file, the number following the # on disassembled instructions is the number of cycles spent executing vCPU instructions whose address is less or equal to that of the current instruction. Therefore, to know how many cycles are spent in a routine, one has just to compute the difference between the # number of the last instruction of the routine and the # number of the executed instruction that immediately precedes the first instruction of the routine. No # number means that the instruction did not execute.

Mandelbrot_v1.txt: (42.72 KiB) Downloaded 130 times

Update: This means of course that one should really optimize this function, even in vCPU form. For instance, the following GCL/GT1 files use an unrolled vCPU multiplication that is almost twice as fast as the original one (654208990 cycles instead of 1300670415), making the whole code about 40% faster.

Mandelbrot_DEVROM.gt1: (1.19 KiB) Downloaded 125 times

Mandelbrot_DEVROM.gcl: (8.65 KiB) Downloaded 136 times

Your code might suffer from alignment issues with the long SYS call. Assume the first 120 cycles SYS call starts at the beginning of a usable scanline. This leaves just enough time for one vCPU instruction on that scanline. Chances are that you need more than two additional instructions to setup the next SYS call. But this also means that the SYS call cannot run in the next scanline. So, for the unsigned multiplication part, you need at least 7 scanlines (four for the sys calls, plus three between the sys calls that may be underused.)

Post by **at67** » 08 Nov 2021, 15:42

qwertyface wrote: ↑06 Nov 2021, 18:04 Is my code uniformly faster across the board, or am I making big gains on some "expensive" pixels, and doing less well on other, "cheaper" pixels. After all, Marcel's code would be very fast when one value is zero! Does that happen? Does that happen often?

Marcel's code will only be fast some of the time when one of the inputs is zero, (it depends on which one), Marcel's algorithm, (and any of this type that loop until no bits are set), can usually be improved by doing a highest set bit detection on both inputs and swapping inputs to make sure that input with lowest high bit set is used as the no bits set check.

Whether this kind of change would have a significant improvement in the current context is open to speculation: You potentially save a lot of "loop vCPU cycles" by adding a fixed startup vCPU cycle cost, the only way to know is to measure

qwertyface wrote: ↑06 Nov 2021, 18:04 [*]SYS_MultiplyBytes_120 is an unusually long SYS function, and one of the downsides of long SYS functions is that they might "stall" [not sure if we have a preferred term for this] if there isn't time for them to run. What happens when you call the same long SYS function four times with a few instructions between? Does stall add up to become a problem?

lb3361 wrote: ↑08 Nov 2021, 00:46 Your code might suffer from alignment issues with the long SYS call. Assume the first 120 cycles SYS call starts at the beginning of a usable scanline. This leaves just enough time for one vCPU instruction on that scanline. Chances are that you need more than two additional instructions to setup the next SYS call. But this also means that the SYS call cannot run in the next scanline. So, for the unsigned multiplication part, you need at least 7 scanlines (four for the sys calls, plus three between the sys calls that may be underused.)

As lb3361 rightly pointed out, it's the intermingling of vCPU with your SYS calls that is probably holding your code back by close to another factor of 2x, this is why self-restarting SYS calls are so powerful, especially ones that land on or below the efficiency buckets of 28, 36, 48 and 74 cycles. e.g. I have a self-restarting 16bit divide routine that is 82 cycles, if I can get that to 74 or lower cycles, then it will execute approximately twice as fast, (as long as it's run in a screen mode with multiple free 148 cycle scan-lines and those 148 cycle free scanlines are the dominant scan-lines).

qwertyface · Post by **qwertyface** » 08 Nov 2021, 16:44

Part 4: An invisible bug
For practical reasons it seemed like investigating whether I was uniformly faster was the first thing to investigate. It seemed easier than trying to measure the impact of stall, and depending on what I found it could have a big impact on what I did next.

My plan was this - record the time taken for each call to CalcPixel, much as I was recording the time for CalcSet in the benchmark, both for the modified and unmodified code, but also note which pixel is being drawn. Then plot a heatmap of the time for each pixel. If I see a Mandelbrot Set, then the runtime was directly proportional to the number of iterations, but if I see lots of "noise", e.g. darker and lighter points, then there's something else going on. I could subtract the two datasets to find out which pixels I'm slower on if any.

Here's the output from Marcel's code.

: marcel1.png (36.96 KiB) Viewed 2240 times

There is some variation, but I don't think it's very significant. Here it is with the colours on a logarithmic scale, to make it more obvious.

: marcel-log.png (70.88 KiB) Viewed 2240 times

And here is my modified versi... OH FOR HEAVENS SAKE.

: mine.png (47.05 KiB) Viewed 2240 times

lb3361 · Post by **lb3361** » 08 Nov 2021, 17:09

The invisible bug would make a difference. There is code in CalcSet that bails out when one is inside the first lobe (the buggy one) or the cardioid. If that code fails, you're good for 64 iterations of the Mandelbrot function on each pixel...

Meanwhile I looked into the profiling results for the unrolled vCPU multiplication and found that only 10% of the multiplications involve numbers with nonzero high bytes. This means that the 90% remaining multiplications are just 8bits x 8bits unsigned, well within reach of your fast multiplication trick. Even a vCPU implementation can be very effective (four to five times faster than the original.)

qwertyface · Post by **qwertyface** » 10 Nov 2021, 13:19

lb3361 wrote: ↑08 Nov 2021, 17:09 The invisible bug would make a difference. There is code in CalcSet that bails out when one is inside the first lobe (the buggy one) or the cardioid. If that code fails, you're good for 64 iterations of the Mandelbrot function on each pixel...

It certainly did! Fixing the constant (I'd just forgotten to update one value when I moved from a MulShift7 to a MulShift8 routine) took me from 8:42 to 7:13, a difference of 89 seconds - I was now at 36.4% of the unmodified runtime - within touching distance of a factor of three improvement!

It wasn't what I was expecting to see when I did this test, but I guess that's the point of visualising things, if you only saw the expected there would be no point. I certainly can't see another way I'd have spotted the bug.

For completeness here's the fixed code, with the log normalised colour scale

: fixed-log.png (79.68 KiB) Viewed 2216 times

Interestingly I think it's clearer that there's variability in this one.

Here's a picture of time saved per pixel (log normalised)

: time-saved-log.png (82.35 KiB) Viewed 2216 times

And here's the pixels where I am slower

: slower-pixels.png (6.27 KiB) Viewed 2216 times

I assume this is simply that because I'm more accurate (8 bits of fraction), the bulb optimisation is kicking in later on my version.

lb3361 wrote: ↑08 Nov 2021, 17:09 Meanwhile I looked into the profiling results for the unrolled vCPU multiplication and found that only 10% of the multiplications involve numbers with nonzero high bytes. This means that the 90% remaining multiplications are just 8bits x 8bits unsigned, well within reach of your fast multiplication trick. Even a vCPU implementation can be very effective (four to five times faster than the original.)

Exciting times ahead!

Gigatron Hackers

Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine