## Implementing a faster multiplication routine

Using, learning, programming and modding the Gigatron and anything related.
Forum rules
Be nice. No drama.
qwertyface
Posts: 41
Joined: 16 Jul 2019, 09:19

### Implementing a faster multiplication routine

Part 0: It was known to the ancient Babylonians

Hi All,

For some reason I found myself reading the Wikipedia article Multiplication Algorithm earlier in the year, and came across Quarter Square Multiplication. The idea is that if you have a table of floored quartered square numbers (0, 0, 1, 2, 4, 6, 9 and so on) it enables you to multiply any two numbers with a few additions, subtractions and lookups. Of particular significance, it takes a constant amount of these operations - there's no looping required, and there's no shifting involved either. There are a number of 6502 implementations, and it's described as the fastest way to implement multiplication on that processor - if you can bear the cost of the large lookup tables.

I thought that this sounded like a perfect fit for a Gigatron SYS function, where constant time algorithms are very desirable, and cheating with lookup tables is all part of the game. I thought it would be pretty easy to implement, and lightning fast to run.

It actually turned out to be a lot of fun, but there's quite a lot to describe. So I'm going to periodically add posts to this topic showing what I have done. You'll have to read along to see if I ended up with something fast or not.
qwertyface
Posts: 41
Joined: 16 Jul 2019, 09:19

### Re: Implementing a faster multiplication routine

Part 1: bne ac

When I looked at 6502 implementations of the Quarter-Square algorithm, what I found was that they tended to use four pages for their lookup tables (1 KiB). This allows them to multiply any two bytes - two pages would hold the high and low bytes for the quarter-squares 0-255, and two would hold the the high and low bytes for the quarter-squares 256-511. You need to be able to go up to 511, if you want to multiply two bytes, because the first step is to lookup the value for the sum of the two operands - which is a 9-bit number.

Almost immediately I realised that there's a problem with implementing a quarter-square lookup table in native Gigatron code. It's not easy to have the table for values 0-255, and I decided that I couldn't do 256-511 at all¹!

Lots of native code uses lookup tables, the right-shift table being a good example. Usually the approach is to have each entry be a ld instruction with an immediate operand. Lookups then compute the desired index in the accumulator, and use the double jump trick (a jump/branch instruction with another jump/branch in the branch delay slot) to execute one ld instruction in the table. In the case of the right-shift table where there's 255 entries, lookups use jmp y,ac; bra \$ff, with y holding the page of the table. The first jump causes the shifted value to be loaded into the accumulator, and the second causes us to then go to the end of the page (from where you roll into the next, and carry on with whatever you were doing). But significantly, we always end up running two instructions in the page of the lookup table. We can use the jmp instruction with the y register to jump in from a different page, or to jump out to a different page, but we can't use it twice and execute only one instruction in the page, because that would require y to change between the two jumps and they must be immediately after each other for the trick to work.

The problem is that there are 256 different values that I might need to lookup. On the face of it, there's no way that the usual approach can be used since every instruction in the page is potentially used, and so there's no landing place for the second jump to go to.

What I realised was that I could exploit the nature of the data, and do a different variation of the double-jump trick. The first two values in the low-byte table are zero. If the jump into the table is the last instruction on the previous page, you can use one of the conditional branch instructions (bne, branch if not accumulator not zero) to get in (branches are special on the last instruction of a page: they go to next page as the high-byte of the program counter has already incremented when the jump happens). Then I can use jmp as the first instruction of the following page. If the first branch isn't taken, then we'll end up executing the second instruction in the page, ld \$0, in the branch delay slot of the second jump. So if the accumulator is zero, we load zero, and if the accumulator is 1, we load zero, and if it's any other value, we load the right thing too.

So I wrote that. But then I thought about it, and realised "branch to the value in the accumulator, if the accumulator isn't zero" is a pretty weird instruction, does it definitely exist? And I checked the list of instructions I normally refer to, and it's not listed, and I checked the ROM listings, and it's not there, but my code had assembled, and it does make sense in terms of the instruction encoding, and gtemu.c seems happy to execute it. So I guess it's good, right? It still isn't tested on real hardware, but I expect it to work.

I suspect I am the first person ever in the world to have written a program that makes use that instruction. How often can someone say that?

For the high-byte table I used basically the similar trick, only this time the first 32 values are zero, so what I did was shift the table down by 32, subtract 32 from the value, and use the bge ac instruction to jump when the value is positive. Again, I can't see any sign that this instruction has been used before. By shifting the table down, I was able to store the code to enter the both tables above the high-byte table, and have the two back to back.

With this I was able to write a routine that multiplies 7-bit numbers together in 64 cycles. It wasn't a SYS function, it was developed in its own .asm.py file, but it proved the concept, and 64 cycles feels pretty fast! Of course, multiplying 7-bit numbers, and getting a 14-bit result isn't an amazingly useful routine. Multiplying two bytes and getting a 16-bit result would be much better.

¹. Writing this just now, I've realised that it's actually perfectly easy - the maximum sum of two bytes is 510, so the final instruction of each page is free as it is in the right-shift table. Perhaps I should have written this down earlier! As it is, I'm not sure that this would have been useful.
at67
Posts: 433
Joined: 14 May 2018, 08:29

### Re: Implementing a faster multiplication routine

qwertyface
Posts: 41
Joined: 16 Jul 2019, 09:19

### Re: Implementing a faster multiplication routine

Thanks, it gets more interesting.
Sugarplum
Posts: 44
Joined: 30 Sep 2020, 22:19

### Re: Implementing a faster multiplication routine

This sounds fascinating and impressive considering the hardware.

If/when I do the FPGA build, I think I'd want to try a hardware multiplier. I'd probably opt for 4 nibble LUTs with the whole process taking no more than 3 cycles. Using this approach, 2 8x4 LUTs would be better, but I'd likely go with 4-4x4 instead. 1K is less space than 8k, but a cycle slower. But all 4 lookups will be in parallel. If memory was not an issue, then, 1 huge lookup would be the best or even 2 8*4 lookups with 1 12-bit addition.

My method would be similar to the algebraic FOIL method, or, Q = ((A1 * B1) + (A2 * B2)) + ((A1 * B2) + (A2 * B1)). I'd use some refinements such as not actually adding at all for the first half. The highest and lowest intermediate results can share the same intermediate bus without collision and without actually having to add those. That uses a hardwired shift rather than adding. The 2 middle intermediate results would need to be added together into at least a 9-bit register. Then to tie the 2 intermediate results together into a 16-bit result would take a 12-bit adder. The lowest nibble of the result is already correct from the lookups.
qwertyface
Posts: 41
Joined: 16 Jul 2019, 09:19

### Re: Implementing a faster multiplication routine

Part 2: 7-bit to 8-bit and a Slow Fast Mandelbrot

I had written a multiplication routine that could multiply two 7-bit numbers, giving a 14-bit result, but it seemed to me like really what I should expose to vCPU is a routine that multiplies two bytes, and returns a 16-bit result. This would be a much more useful routine to build something on top of.

I implemented this in what seemed like the obvious way. If we call my two operands a and b, then we can split them into the high bit (a₇ and b₇, and the low seven bits (I'll call them A and B, but really A = a₆2⁶ + a₅2⁵ + a₄2⁴ + a₃2³ + a₂2² + a₁2¹ + a₀2⁰ and B = b₆2⁶ + b₅2⁵ + b₄2⁴ + b₃2³ + b₂2² + b₁2¹ + b₀2⁰). By multiplying out (a₇2⁷ + A)(b₇2⁷ + B) I get:
• 2¹⁴ + 2⁷(B + A) + AB if a₇ is 1 and b₇ is 1
• 2⁷B + AB if a₇ is 1 and b₇ is 0
• 2⁷A + AB if a₇ is 0 and b₇ is 1
• AB if a₇ is 0 and b₇ is 0
In other words, it's not too bad. I just need to test the high bits and depending on what is set, I need to do an addition or two, and a single right shift.

While it was conceptually a simple matter, it did end up being quite a bit of code. Once I was through with this, I could multiply any two bytes, but the cost had gone from 64 cycles to 102 cycles in the worst case. Is 102 cycles fast? I guess, probably? So I made a copy of the ROM and rewrote it as a SYS function (where I shaved a few cycles, but SYS functions have their own overhead). The final result was that it was up to 120 cycles for the case where either or both bytes had the high bit set, and 96 cycles if they don't. And it's the major part of three pages of code with those big lookup tables. This is without using one of the page zero slots, which would make the routine slower still.

So at this point I had a modified ROM.asm.py file, with a new SYS function, SYS_MultiplyBytes_120. The next thing to do was to use it.

I wanted to see how much faster the Mandelbrot program was using my new routine. I don't really know how fast I was expecting it to be, but definitely lots faster; some big multiple, maybe 5 or 10 times faster or something? Marcels version uses a shift-and-add approach to multiplication in vCPU code, and my assumption was that this must be slow. I replaced his MulShift7 routine (which multiplies two signed 16-bit numbers and right-shifts by seven places, the program is based on fixed-point arithmetic), with a MulShift8 routine which splits the two values into four bytes, finds the four partial products, and adds them together (effectively I'm calculating the 32-bit product of two 16-bit numbers, and throwing the high and low bytes away).

In the default video mode (one blank line in four), the unmodified code draws the whole Mandelbrot set in about 17 minutes (that's what the clock on the screen shows). With my modifications it completed in under 8. So faster by about a factor of two.

So it is faster, but It hardly justifies the amount of ROM space involved. What is going on? If I've implemented a fast algorithm for multiplication in native code, and Marcel implemented a slow algorithm in interpreted code, surely the difference should be huge! Two is not a huge number!
lb3361
Posts: 123
Joined: 17 Feb 2021, 23:07

### Re: Implementing a faster multiplication routine

Marcel's code does not compute all the bits of A * B. It seems to assume that the absolute value of A has at most 9 bits (because the initial value of variable 'bit' is 0x200) and that the absolute value of B has at most 13 bits. Then it only computes the 15 most significant bits, that is (A * B) >> 7, ignoring possible carry propagating on the seven low order bits he discards.

In fact this could be done a little bit more efficiently with something like this.

Code: Select all

``````/* In C first because I cannot think in GCL */
unsigned int mult7(unsigned int a, unsigned int b)
{
register unsigned int bit = 1;
register unsigned int c = 0;
b <<= 3;
for(;;) {
if (a & bit)
c += b;
bit <<= 1;
if (bit & 0x800)
break;
c >>= 1;
}
return c;
}
``````
or

Code: Select all

``````  {Multiply}
\SYS_LSRW1_48 \sysFn:
0 C=
B 3<< B=
1 bit=
[ do
A bit& [if<>0 C B+ C=]
bit 1<< bit=
>bit, \$8&
if=0 C 48!! C=
loop]
``````
This seems to run a tad faster but not much (maybe 15% faster.) The code below says DEVROM but works on v5a. To avoid overflows in the "wide" screen, it was necessary to make a version that handles 10 bits for A and 13 for B instead of just 9 and 13. Marcel's code seems to avoid the problem because he compares bit and A with a subtraction instead of a binary and. But that cannot be right.

Thinking more about it, your multiplication is 120 cycles long, therefore it runs once per usable scanline. The loop above runs 9 time and needs at least one scanline per iteration. Even if it were written in native code, one would not do more than two iterations per scanline. So your multplication is really faster, probably ten times faster. The problem is the rest of the Mandelbrot code which seems to be as expensive as the multiplications themselves. Therefore a factor of two is probably as good as it gets!
Attachments
Mandelbrot_DEVROM.gt1
Mandelbrot_DEVROM.gcl
qwertyface
Posts: 41
Joined: 16 Jul 2019, 09:19

### Re: Implementing a faster multiplication routine

lb3361 wrote: 03 Nov 2021, 22:52 Thinking more about it, your multiplication is 120 cycles long, therefore it runs once per usable scanline. The loop above runs 9 time and needs at least one scanline per iteration. Even if it were written in native code, one would not do more than two iterations per scanline. So your multplication is really faster, probably ten times faster.
To my shame, I hadn't at this point in the story done any of this sort of analysis (I do a bit of it later, but after a lot of misdirected effort). I think you're right that the key thing is thinking about performance in "per scanline" terms. Don't forget though that I'm making four calls to MultiplyBytes, for each multiplication.
lb3361 wrote: 03 Nov 2021, 22:52 Therefore a factor of two is probably as good as it gets!
It turns out that better than a factor of two is very easily achievable (although I still haven't got a result as dramatic as I'd like).
at67
Posts: 433
Joined: 14 May 2018, 08:29

### Re: Implementing a faster multiplication routine

This sounds like Amdhal's Law paying a practical visit, optimising only one part of an algorithm can often have surprising results such as this. But when you sit down and think about what is going on, it starts to make sense.

One of the things I learnt very early on in my career as a 3D rendering engineer was that batching of draw-calls based on common resources, (vertex buffers, textures, expensive 3d pipeline state switches, etc), was crucial in unburdening the CPU and keeping the GPU pegged as close as possible to 100% utilisation, (keeping the CPU as free as possible for everything else); and even though the Gigatron is a far cry from a modern CPU-GPU 3d pipeline, the basic concepts are still the same. You can think of vCPU as a manager and native code as the worker, it's just like a CPU-GPU symbiosis, i.e. if you try and do everything on the CPU, (aka in vCPU land). you will suffer from an unbalanced workload that could potentially be much better spread across both vCPU and native lands.

e.g. The number of vCPU instruction slots for ROMv5a and ROMvX0 available per 60Hz frame is approximately:

Code: Select all

``````            Scanlines:      4       3       2       1
vCPU slots:
ROMv5a                   190     815     1425    2040     (*) MaxTicks=28
ROMvX0                   175     755     1340    1920     (*) MaxTicks=30
``````
This really hits home how little vCPU time per scanline and per 60Hz frame there is, for anything in any sort of loop, (i.e. blitting, memory copying, complex arithmetic, etc), you are usually going to be spending a large portion, (if not all), of a 60Hz frame just in that one loop. The solution to this is to follow exactly what a CPU-GPU pipeline does, have vCPU manage high throughput, (burst, tight loop), tasks performed in native code. Obviously this requires ROM modifications, but it also requires an architecture that allows for flexibility and generality, rather than a one to one ROM modification for each new algorithm converted to native code and burnt to ROM.

One possible architecture, (which I have already designed and partially implemented in ROMvX0), is to vectorise/batch data and have self restarting native code SYS calls and instructions that process whole vectors, (arrays), of data completely in native land. So far I have implemented the basic arithmetic operators, (+, -, &, |, ^), and a generic boundary tester. What this means is that vCPU manages input and output arrays, whilst native code processes those arrays anywhere from 10x to 20x faster than vCPU could. An example of the speed possible is my bullet and sprites example that uses the vectored add and vectored boundary tester with vCPU callbacks to perform velocity processing, collision checks and reflections for 24 bullets and 8 sprites extremely efficiently at 30Hz in mode 2, (two scanline mode with 20 extra lines of vBlank): https://www.youtube.com/watch?v=mT-jkaNwsbM

My own 16bit signed multiply native routine, (which is fully compatible with Marcel's vCPU implementation), requires 4 instructions of setup, (including the SYS call), and then 66 cycles per bit; so it can finish a full multiplication best case in nine scanlines, (including the setup). This is about half the speed of yours, which is on the surface extremely surprising, (but not when you analyse carefully as noted above). Where your algorithm would be extremely beneficial is in a vectored/batched approach, (also described above), for say a 3d engine that only required 7 or 8bit multiplies, in this example with fully vectored data, I would expect your algorithm to be at least 10x faster than mine.
Sugarplum
Posts: 44
Joined: 30 Sep 2020, 22:19

### Re: Implementing a faster multiplication routine

Speaking of the "weird instruction," [if (A != 0) PC = hi(PC)|A], that is \$hEE. I found that on this site: https://www.iwriteiam.nl/PGigatron.html