Implementing a faster multiplication routine

Using, learning, programming and modding the Gigatron and anything related.
Forum rules
Be nice. No drama.
lb3361
Posts: 107
Joined: 17 Feb 2021, 23:07

Re: Implementing a faster multiplication routine

Post by lb3361 »

qwertyface wrote: 10 Nov 2021, 13:19 And here's the pixels where I am slower.
I assume this is simply that because I'm more accurate (8 bits of fraction), the bulb optimisation is kicking in later on my version.
If that were the case, we would only see the slower points on one side of the bulb, not both sides. My bets are on the higher accuracy. Although Marcel's version works with 3+7 fixed point numbers, there seems to be cases when an intermediate calculation needs bigger numbers. It works because the code is biased to still give a large result and cause the escape time loop to bail out at the next step. Your code has higher accuracy and might not benefit from this (dubious) optimization.
qwertyface wrote: 10 Nov 2021, 13:19
lb3361 wrote: 08 Nov 2021, 17:09 Meanwhile I looked into the profiling results for the unrolled vCPU multiplication and found that only 10% of the multiplications involve numbers with nonzero high bytes. This means that the 90% remaining multiplications are just 8bits x 8bits unsigned, well within reach of your fast multiplication trick. Even a vCPU implementation can be very effective (four to five times faster than the original.)
Exciting times ahead!
I now have a pure vCPU code that uses a couple tricks that you might find useful because they could work in native code as well.
  • When both numbers are 2+7 fixed point, that is -4<x<4, the code uses a half variant of your quarter square trick: ab = (a^2 + b^2 + |a-b|^2)/2. Since a and b are in range [0,511], the absolute value of |a-b| is also in range [0,511], unlike a+b which could be anywhere in [-511,+511]. So with this variant, I only need 1KB of table (512 times two bytes).
  • When either number is larger, the code reverts to an ordinary loop. This loop represents 1.5% of the computing time.
  • I also notice that many computations are not products but plain squares. So if we have a table of squares, we can read the result directly.
In order to find 1KB of RAM, I steal the first four lines of the screen memory. So the resolution is 160x116 instead of 160x120. But the speedup is significant and totally vindicates your approach.
Mandelbrot_DEVROM.gcl
Fast multiplication for 2+7 fixed point. Otherwise reverts to slow multiplication. Works on ROMv5a and ROMv4.but only hides the stolen scanlines on v5a and up.
(9.33 KiB) Downloaded 10 times
Mandelbrot_DEVROM.gt1
Fast multiplication for 2+7 fixed point. Otherwise reverts to slow multiplication. Works on ROMv5a and ROMv4.but only hides the stolen scanlines on v5a and up.
(1.23 KiB) Downloaded 10 times
The same half-square idea could work in native code and give either a 8x8->16 multiplication or a (1+7)x(1+7)->(2+7) multiplication. That could be more convenient than 7x7->14. In addition, in Mandelbrot, very few multiplications have a multiplicand with a non zero high byte. Therefore you could shortcut all but the first SYS call. That would go even faster.
Post Reply