qwertyface wrote: ↑04 Nov 2021, 18:43
My impression was that multiplication (rather than the CalcPixel loop, updating the clock, or clearing the screen) vastly dominates the runtime. I simply figured that a big improvement in the speed of multiplication should equate to a big difference in the total speed. For what it's worth, I think the first part is correct. I'll talk more about benchmarking and profiling in the next instalment, but at one point in Marcel's version I observed that CalcPixel took 90.24% of the total runtime. I'm pretty sure that must mostly be in MulShift7.
This is a fantastic example to do some back of the envelope experiments with, lets start off with the best case scenario, (with slightly adjusted figures just to make it slightly simpler to deal with):
1) 90% CalcPixel, 10% other, CalcPixel is effectively an inline function whose body is purely the functionality you are replacing, i.e. CalcPixel resolves to just a multiply instruction and you effectively make it execute 10x faster.
Result: 10% + 90% x0.1 = 19%, (effective new total execution time), i.e. 10x partial becomes 5.263x total.
2) 90% CalcPixel, 10% other, CalcPixel is a function in which the multiply takes 80% of the execution time of CalcPixel, (a more realistic example).
Result: 90%x0.8 = 72%, 10% + (90%-72%) + 72%x0.1 = 35.2%, (effective new total execution time), i.e. 10x partial becomes 2.84x total.
3) 90% CalcPixel, 10% other, CalcPixel is a function in which the multiply takes 65% of the execution time of CalcPixel, (an even more realistic example).
Result: 90%x0.65 = 58.5%, 10% + (90%-58.5%) + 58.5%x0.1 = 47.35%, (effective new total execution time), i.e. 10x partial becomes 2.11x total.
You can see how quickly our expectations are shattered by the seemingly non linear change, (mostly explained by Amdahl's law), this used to catch me out constantly when I was optimising code back in the 80's and 90's, (specifically SW based 3D rendering engines). No matter how many times I thought my prediction of potential gains would be close, I would always be dismayed, (and sometimes horrified, based on time spent), to realise how far off my predictions had been. Once I started regularly measuring and applying a more rigorous approach to my expectations, (i.e. similar to the above approach), was the day my blood pressure halved.
The bottom line is, (or another way to think about it is): if you replace a single component of a complex whole with a more efficient version and obtain a total speedup, (of the complex whole), approaching 100%, you have basically performed a miracle! (or the original code was low hanging fruit
).
qwertyface wrote: ↑04 Nov 2021, 18:43
Interesting reflection. I suppose the big difference is that you're not freeing up vCPU to do something else, rather just letting code run where it's fastest, more like using SIMD instructions in a CPU, than offloading things to the GPU. But I think I can see how you can change your perspective to see them as a two processors (and I wonder if a separate VM would make sense...)
It makes perfect sense to treat them as separate processors to me, even though, (as you pointed out), they are merely two different modes of the same execution stream. It allows me to visualise the spread of workload in a typical application that is going to produce the most efficient results, (not necessarily efficient in most vCPU instruction slots used, but in total execution time, vCPU + native, spent per 60Hz frame).
Another way to visualise vCPU and native is as say an interpreted scripting language vs a compiled language as part of a greater whole, (e.g. say a games engine), typically you would have the scripting engine, (lets say Lua), be code that does initialisation, setup, coordination, (i.e. lightweight tasks), of your compiled language, (lets say C++), that usually does the heavy grunt work, (i.e. heavy data manipulation and processing output to a display).
The Lua code, (vCPU), manages and controls the C++, (native code), which produces an excellent balance/spread of work done across both execution models. In the gaming engine world this could potentially save oodles of time and also allow non skilled programmers to produce content at the programming level. In Gigatron land it allows us to potentially use available cycles much more efficiently, (in terms of total work done per time), by pushing the things that vCPU is bad at, to generic SYS calls and instructions in native land.
qwertyface wrote: ↑04 Nov 2021, 18:43
, and either way the vectorised operations sound really interesting! The demo is really impressive too. I really cannot wait to see the code, maybe I should take the day off work when you publish it! I occasionally check your Github activity to see if it's online somewhere!
Haaaa! It's getting close, (especially for the ROM).
qwertyface wrote: ↑04 Nov 2021, 18:43
Yes! Very eyeopening. I think I'd like to have a better understanding of how the time slices work, because I'm not completely clear on how common they all are.
I spent a lot of time gathering vCPU and native execution statistics across my own apps and everyone else's apps I had access to, using simple tools/functions I created as part of the emulator. That was when I realised just how bad vCPU was at looping and that it was typically, (much), better to have vCPU managing and native doing the looping.