Implementing a faster multiplication routine

lb3361 · Post by **lb3361** » 10 Nov 2021, 20:04

qwertyface wrote: ↑10 Nov 2021, 13:19 And here's the pixels where I am slower.
I assume this is simply that because I'm more accurate (8 bits of fraction), the bulb optimisation is kicking in later on my version.

If that were the case, we would only see the slower points on one side of the bulb, not both sides. My bets are on the higher accuracy. Although Marcel's version works with 3+7 fixed point numbers, there seems to be cases when an intermediate calculation needs bigger numbers. It works because the code is biased to still give a large result and cause the escape time loop to bail out at the next step. Your code has higher accuracy and might not benefit from this (dubious) optimization.

qwertyface wrote: ↑10 Nov 2021, 13:19
lb3361 wrote: ↑08 Nov 2021, 17:09 Meanwhile I looked into the profiling results for the unrolled vCPU multiplication and found that only 10% of the multiplications involve numbers with nonzero high bytes. This means that the 90% remaining multiplications are just 8bits x 8bits unsigned, well within reach of your fast multiplication trick. Even a vCPU implementation can be very effective (four to five times faster than the original.)
Exciting times ahead!

I now have a pure vCPU code that uses a couple tricks that you might find useful because they could work in native code as well.

When both numbers are 2+7 fixed point, that is -4<x<4, the code uses a half variant of your quarter square trick: ab = (a^2 + b^2 + |a-b|^2)/2. Since a and b are in range [0,511], the absolute value of |a-b| is also in range [0,511], unlike a+b which could be anywhere in [-511,+511]. So with this variant, I only need 1KB of table (512 times two bytes).
When either number is larger, the code reverts to an ordinary loop. This loop represents 1.5% of the computing time.
I also notice that many computations are not products but plain squares. So if we have a table of squares, we can read the result directly.

In order to find 1KB of RAM, I steal the first four lines of the screen memory. So the resolution is 160x116 instead of 160x120. But the speedup is significant and totally vindicates your approach.

Mandelbrot_DEVROM.gcl: Fast multiplication for 2+7 fixed point. Otherwise reverts to slow multiplication. Works on ROMv5a and ROMv4.but only hides the stolen scanlines on v5a and up.; (9.33 KiB) Downloaded 149 times

Mandelbrot_DEVROM.gt1: Fast multiplication for 2+7 fixed point. Otherwise reverts to slow multiplication. Works on ROMv5a and ROMv4.but only hides the stolen scanlines on v5a and up.; (1.23 KiB) Downloaded 156 times

The same half-square idea could work in native code and give either a 8x8->16 multiplication or a (1+7)x(1+7)->(2+7) multiplication. That could be more convenient than 7x7->14. In addition, in Mandelbrot, very few multiplications have a multiplicand with a non zero high byte. Therefore you could shortcut all but the first SYS call. That would go even faster.

qwertyface · Post by **qwertyface** » 18 Feb 2022, 22:39

Hi everyone, sorry for going cold on this. I hadn't even gotten to writing up the interesting part of my experimentation!

I'm massively impressed by the speedup you achieved with pure vcpu code. My first impression was that rather than vindicating my approach, it made it look pointless!

But...

lb3361 wrote: ↑10 Nov 2021, 20:04 [*] When both numbers are 2+7 fixed point, that is -4<x<4, the code uses a half variant of your quarter square trick: ab = (a^2 + b^2 + |a-b|^2)/2. Since a and b are in range [0,511], the absolute value of |a-b| is also in range [0,511], unlike a+b which could be anywhere in [-511,+511]. So with this variant, I only need 1KB of table (512 times two bytes).

It's only just occurred to me how useful this might be in a native implementation. I don't know why I didn't see it before, I knew about this formula. I haven't written it yet, but I think it might be possible to do 8-bit by 8-bit unsigned multiplication, all in one page, and faster than my previous attempts. I might be wrong, but if you do it nibble-wise, and have a separate table for when you're multiplying by a high nibble...

veekoo · Post by **veekoo** » 19 Feb 2022, 06:08

I like this approach. The Mandelbrot_DEVROM.gt1 is really fast. In the past there was FractInt for 8-bit machines, so this trick and integer calcuation is way to go. My floating point fractals are only the basic code and most accurate AND very slow.

lb3361 · Post by **lb3361** » 19 Feb 2022, 16:19

I was also surprised that vCPU alone could get so much more speed. But it remains that this was your idea, and that your estimate of how much one can speedup mandelbrot was precisely right. Mine was pessimistic and totally wrong...

Note also that the multiplication it contains is very narrowly targeted to the application at hand. Whenever the argument does not fit in 3+7 bits, the multiplication routine gives an incorrect result, yet large enough to escape the Mandelbrot loop. Also, because the Mandelbrot code computes a lot of squares and a single true multiplication per loop, the vCPU code can speed up the computation of the squares, reducing it to a fast table lookup after taking an absolute value (not negligible).

A difficult part in making a native multiplication is the amount of overhead one has to pay. There is the overhead of setting up a SYS call, but this could be worked around by defining a new vCPU instruction (at67 has the framework for this). What should that instruction be? A 8x8 multiplication? The cost of the 8x8 multiply is probably not the three table lookups, but adding/subtracting them with the proper carry between the two bytes of the result. At some point the overhead of instruction dispatch becomes small enough that it might be just as good to let the vCPU do the additions. So, maybe the right native instruction is a square function with its absolute value? Can we use such an operation to make a competitive 16x16 multiplication? None of this is simple...

- L.

qwertyface · Post by **qwertyface** » 20 Feb 2022, 00:19

I think all the points you've made are exactly spot on.

lb3361 wrote: ↑19 Feb 2022, 16:19 What should that instruction be? A 8x8 multiplication?

Yes - I think so. Anything else would seem overly specific to an intended application. vCPU has a sixteen bit accumulator, so that seems like the result size it can best handle, and it's easy to build what you want with it. I'm not specifically interested in making Mandelbrot fast; a generally useful multiplication routine is my goal.

I've spent a little time today playing with writing a new SYS_MultiplyBytes routine, and I now expect it to have almost exactly the same runtime cost as the previous one (give or take). It might be a bit better in some respects (it will certainly use less ROM space, and it might be a bit more amenable to becoming self-restarting).

This code is quite different to my previous attempt. This time I'm calculating the four partial products of the four nibbles of the two bytes. So on the face of it, I'm doing a lot more work. It does eight lookups (not six, as it would if I were looking up the both bytes of the three term to do it in one shot; The nice thing is that the a² and b² terms each get reused once, only (a-b)² is unique to each partial product, so it's not twelve), and there's a lot more adding up. On the upside, because the lookup tables, are small, and thus my code is in the same page as the table, I don't have to always jump through the same code to do a lookup. This means I don't have to spend a lot of effort managing continuation addresses. Because the partial products all fit in bytes, a lot of the extra additions don't have to worry about carries. But more code is more code, and I do have to do some shifting, which ends up being quite a large proportion of the runtime.

lb3361 wrote: ↑19 Feb 2022, 16:19 The cost of the 8x8 multiply is probably not the three table lookups, but adding/subtracting them with the proper carry between the two bytes of the result. At some point the overhead of instruction dispatch becomes small enough that it might be just as good to let the vCPU do the additions.

Actually both. Working with big ROM tables isn't cheap, but my experience today is that avoiding them by doing smaller operations might multiply up to be about the same. As you said: overheads. But you're right about addition. It definitely adds up, and can't be avoided. I have some ideas about writing code so that it can be reusable, so you could use existing addition routines (equivalent to letting vCPU do it) but ultimately it can't make things faster.

lb3361 · Post by **lb3361** » 22 Feb 2022, 14:30

My suggestion is to not only create a fast multiplication, but to also think in the ways it is going to be used from vCPU code with minimal overhead. For instance you could consider how to program a 16x16->32 vCPU multiplication using four 8x8->16 calls with minimal data movement. That in turns suggests little changes in how you routine is called but they make a difference. If you want to create a vCPU opcode in the devrom, you can experiment by displacing the PUSH/POP instructions into a distant page (keeping them under 28 cycles). This opens ten jump slots for new instructions. If the instruction takes more than 28 cycles, you have to check the remaining time and restart if there is not enough. At67 has been perfecting this technique to a great extent, but the initial steps are relatively easy. Look like this:

Code: Select all

.... jump slot
ld(hi(MYINSTRUCTION#13),Y)
jmp(Y, MYINSTRUCTION#12)
st([vTmp])          #12 (delay slot after jump)
.....
label('MYINSTRUCTION#13')
ld(min(0,maxTicks-ACTUALCYCLES/2))   #13
adda([vTicks])            #14
bge('ACTUALCODE#17')       #15
ld([vPC])          #16
suba(2)           #17
st([vPC])         #18
ld(hi(REENTER),Y)     #19
jmp(Y, REENTER)      #20
ld(-24/2)         #21

Gigatron Hackers

Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine

Re: Implementing a faster multiplication routine