If that were the case, we would only see the slower points on one side of the bulb, not both sides. My bets are on the higher accuracy. Although Marcel's version works with 3+7 fixed point numbers, there seems to be cases when an intermediate calculation needs bigger numbers. It works because the code is biased to still give a large result and cause the escape time loop to bail out at the next step. Your code has higher accuracy and might not benefit from this (dubious) optimization.qwertyface wrote: ↑10 Nov 2021, 13:19 And here's the pixels where I am slower.

I assume this is simply that because I'm more accurate (8 bits of fraction), the bulb optimisation is kicking in later on my version.

I now have a pure vCPU code that uses a couple tricks that you might find useful because they could work in native code as well.qwertyface wrote: ↑10 Nov 2021, 13:19Exciting times ahead!lb3361 wrote: ↑08 Nov 2021, 17:09 Meanwhile I looked into the profiling results for the unrolled vCPU multiplication and found that only 10% of the multiplications involve numbers with nonzero high bytes. This means that the 90% remaining multiplications are just 8bits x 8bits unsigned, well within reach of your fast multiplication trick. Even a vCPU implementation can be very effective (four to five times faster than the original.)

- When both numbers are 2+7 fixed point, that is -4<x<4, the code uses a half variant of your quarter square trick: ab = (a^2 + b^2 + |a-b|^2)/2. Since a and b are in range [0,511], the absolute value of |a-b| is also in range [0,511], unlike a+b which could be anywhere in [-511,+511]. So with this variant, I only need 1KB of table (512 times two bytes).
- When either number is larger, the code reverts to an ordinary loop. This loop represents 1.5% of the computing time.
- I also notice that many computations are not products but plain squares. So if we have a table of squares, we can read the result directly.

The same half-square idea could work in native code and give either a 8x8->16 multiplication or a (1+7)x(1+7)->(2+7) multiplication. That could be more convenient than 7x7->14. In addition, in Mandelbrot, very few multiplications have a multiplicand with a non zero high byte. Therefore you could shortcut all but the first SYS call. That would go even faster.