Zoom rotator [test]

Using, learning, programming and modding the Gigatron and anything related.
Forum rules
Be nice. No drama.
lb3361
Posts: 394
Joined: 17 Feb 2021, 23:07

Re: Zoom rotator [test]

Post by lb3361 »

It's all about the bandwidth one can get!
Using SYS calls (when possible) is just a way to read/write more pixels in the same time.
But then you have to account for the setup time, that is, deciding and specifying which pixels to copy.

There should be an opcode NCOPY in ROMvX0 that is slower than SYS_CopyMemory but might have smaller setup times. I wrote it. Alas I do not remember its peak performance. Looking at the code it seems to achieve 8 pixels per scanline, two thirds of the peak speed of SYS_CopyMemory. But when the chunk sizes are small, the setup times and the details matter a lot, and NCOPY might become beneficial.

Code: Select all

# pc = 0x23cd, Opcode = 0xcd
# Instruction NCOPY (lb3361): copy n bytes from [vAC] to [vDST]. vAC+=n. vDST+=n
label('NCOPY')
Later I wrote another variant of this for dev7rom, which a lot of work on both the peak speed and the setup times. But dev7rom has no backward copy variant, so that might not be good enough for you. Maybe I should try to use it in my version of the rotator. However I recognize that my version is a cheat since I precompute everything to minimize the setup times (this is why my gt1 is so big and yours so small.)

Code: Select all

                                          7967  # Instruction COPYN (35 cf nn)
                                          7968  # * Copy nn bytes from [T3] to [T2].
                                          7969  # * Handles page crossings. Peak rate 10 bytes/scanline.
                                          7970  # * On return, T3 and T2 contain the next addresses.
                                          7971  # * N:Cycles 1:58 2:84 3:110 4:84 5:112 6:138 7:164 8:138
                                          7972  # * Origin: this is an improved version of the copy
                                          7973  #   opcode I wrote for ROMvX0.
                                          7974  oplabel('COPYN_v7')
Incidentally I do not understand the details of your zoom/rotate algorithm (how it breaks it into small blits), but I am curious to understand how often you're using the backward copy. It seems only necessary when blitting an overlapping source block with the same Y coordinate and smaller X coordinate.
Phibrizzo
Posts: 97
Joined: 09 Nov 2022, 22:46

Re: Zoom rotator [test]

Post by Phibrizzo »

There should be an opcode NCOPY in ROMvX0 that is slower than SYS_CopyMemory but might have smaller setup times. I wrote it. Alas I do not remember its peak performance. Looking at the code it seems to achieve 8 pixels per scanline, two thirds of the peak speed of SYS_CopyMemory. But when the chunk sizes are small, the setup times and the details matter a lot, and NCOPY might become beneficial.
Yes, You're right. But is one small problem with NCOPY. I downloaded the latest version ROMvX0 from
https://github.com/at67/ROMvX0/blob/main/ROMvX0.rom
and any instructions from Prefix2 and Prefix3 not working. Maybe is too old?

MD5 of this ROM is: 1d2dd9b21ad7d34432223b731633f3da (nobody uses versions extension)
However I recognize that my version is a cheat since I precompute everything to minimize the setup times (this is why my gt1 is so big and yours so small.)
Who cares. Cheats, please. All demoscene coders does it like that. vCPU is too slow, You must do all what is possible for greater efficiency.
Incidentally I do not understand the details of your zoom/rotate algorithm (how it breaks it into small blits), but I am curious to understand how often you're using the backward copy. It seems only necessary when blitting an overlapping source block with the same Y coordinate and smaller X coordinate.
Yes, it has been optimized so much that propably on one will understand it ;)
I must use backward copy when coordination Y of souce area is smaller that destination.
This is used about (average) half times by frame.

Source of latest version:
http://changeit.ppa.pl/ftp/r_6x0.asm
lb3361
Posts: 394
Joined: 17 Feb 2021, 23:07

Re: Zoom rotator [test]

Post by lb3361 »

Phibrizzo wrote: 21 Dec 2024, 16:53 I must use backward copy when coordination Y of souce area is smaller that destination.
Then I believe you can go even faster.

What matters is to maximize forward blitting on each row (because one can use the gigatron auto-incrementing store opcode).
But even if always go forward in X, you can process the Y in either ascending or descending order at little cost.


When blitting overlapping rectangles, you have the following cases
  • If Ydst > Ysrc, then you can process Y forwards and X forwards.
  • If Ydst < Ysrc, then you can process Y backwards and X forwards.
  • If Ydst = Ysrc and Xdst > Xsrc, then you can process X forwards and Y however you prefer.
  • Otherwise, you have to process X backwards. Should be much less than half the times.
lb3361
Posts: 394
Joined: 17 Feb 2021, 23:07

Re: Zoom rotator [test]

Post by lb3361 »

Another variant of my rotator.
Using the dev7rom COPYN opcode, the size can be increased from 51x51 to 85x85 while keeping ten frames per second.
It runs on the online emulator.
Attachments
rotator_dev7_64k.gt1
(20.43 KiB) Downloaded 191 times
rotator.tar.gz
(2.69 KiB) Downloaded 204 times
lb3361
Posts: 394
Joined: 17 Feb 2021, 23:07

Re: Zoom rotator [test]

Post by lb3361 »

In fact, extending the effect to 119x119 gives 6fps.
Same source code, just different defines in tablegen.c.
That's a bit above half the pixels on the screen.

I cannot make it full screen because the program would become too big.
However the speed would be about 3-4 fps, not far from Phibrizzo's effect.
Attachments
rotator_119_dev7_64k.gt1
(30.75 KiB) Downloaded 193 times
Phibrizzo
Posts: 97
Joined: 09 Nov 2022, 22:46

Re: Zoom rotator [test]

Post by Phibrizzo »

lb3361 wrote: 21 Dec 2024, 22:14 Then I believe you can go even faster.
Yes, You're right. I did it.
Now is very faaast :)
Backward copy is now only few times exetuting.

Source:
http://www.changeit.ppa.pl/ftp/r_7x0.asm
Attachments
rotator7_vX0.gt1
(379 Bytes) Downloaded 191 times
lb3361
Posts: 394
Joined: 17 Feb 2021, 23:07

Re: Zoom rotator [test]

Post by lb3361 »

Phibrizzo wrote: 23 Dec 2024, 17:25 Now is very faaast :)
I ported it to dev7 (because of my own instrumentation) and it works very well.
I measured 4.6 frames per second, which is better than what my code would achieve on a full frame.
The idea to use 32x32 blocks with randomized phase is very neat.
r_dev7.gt1
(395 Bytes) Downloaded 168 times
I think you won handsomely.

Source code inside the zip file as a single .s file using the glcc builtin assembler (not the most user friendly.)
I tried both SYS_CopyMemory and COPYN. The latter seem a bit faster (4.6 fps vs 4.4 fps), probably because it processes the 32 pixels as eight four-pixels bursts as opposed to five six-pixels bursts and two singles and also because it packs better into successive scanlines. Not clear that this would work the same way on ROMvX0 (the SYS_CopyMemory code is the same, but not the NCOPY code). The few backward copies are implemented by hand but quite compact because they use the long instructions LDLAC/STLAC.
r_dev7.zip
(1.56 KiB) Downloaded 165 times

Incidentally, 4.6 full frames per second in mode 3 is equivalent to about 3 pixels per scanline. The COPYN(32) instruction that processes each row takes 462 cycles split into 17 small chunks -- 28+7*(26+28)+26+30 -- that pack 5 per scanline in about 3.5 scanlines, amounting to a bit above 9 pixels per scanline. The code that loops over 32 rows in a 32x32 blit takes 140 cycles (20+30+30+18+18+24). So each 32x32 blits takes 147 scanlines (accounting for packing) so we're down to 7 pixels per scanline. So half of the time is still spent in overhead elsewhere.

Update: fixed buglet, gained a little perf.

Update: below is the result of profiling the vcpu code. The number following the # represents the number of cycles spent executing this instruction (including lost cycles following the instruction because of packing issues). One can sort by decreasing cost with "sort -k 1.59nr r_dev7.txt" and see where the most gains could be made (possibly with well conceived opcodes or sys calls.)
r_dev7.txt
(11.34 KiB) Downloaded 163 times
[/color]
lb3361
Posts: 394
Joined: 17 Feb 2021, 23:07

Re: Zoom rotator [test]

Post by lb3361 »

Runs at 5.26 fps with a few little changes.
Attachments
r_dev7.txt
(11.63 KiB) Downloaded 164 times
r_dev7.zip
(1.57 KiB) Downloaded 160 times
r_dev7.gt1
(400 Bytes) Downloaded 157 times
lb3361
Posts: 394
Joined: 17 Feb 2021, 23:07

Re: Zoom rotator [test]

Post by lb3361 »

Somehow PhiBrizzo's program keeps me excited, maybe because I do not yet fully understand how it works. This is magic.

Anyway, this prompted me to make drastic changes to an experimental version of dev7rom, with two new opcodes named FILL and BLIT that are optimized for pixel bandwidth and little else. Alas these changes break compatibility with earlier versions of dev7rom. I'll make another post to that effect. Use the following GT1 in http://www.gigatron128k.com. I changed the emulator to have the experimental rom, and I also made possible to load any rom using the file dialog (just give a file whose name ends in ".rom").
    r_dev7.gt1
    (222 Bytes) Downloaded 10 times
      Here is the disassembly of the main section. The rest is the same small table as other versions of Phibrizzo's rotator.

      Code: Select all

      # set video table
      0200  b1 42 01 00       [vCPU] MOVIW  $0100,$42          #30          |1B..|
      0204  dd 42                    PEEKV  $42                #3780        |]B|
      0206  e3 78                    ADDI   $78                #3900        |cx|
      0208  f0 42                    POKE   $42                #3120        |pB|
      020a  93 42                    INC    $42                #2948        |.B|
      020c  93 42                    INC    $42                #2160        |.B|
      020e  1a 42                    LD     $42                #3000        |.B|
      0210  8c f0                    XORI   $f0                #1680        |.p|
      0212  35 72 02                 BNE    $0204              #3420        |5r.|
      # set video x shift
      0215  11 01 01                 LDWI   $0101              #20          |...|
      0218  46 28                    POKEQ  $28                #20          |F(|
      # set mode 3
      021a  b1 22 0b 00              MOVIW  $0b00,sysFn        #30          |1"..|
      021e  59 03                    LDI    3                  #52          |Y.|
      0220  b4 e7                    SYS    78                 #50          |4g|
      # clear screen
      0222  b1 88 80 20              MOVIW  $8020,vT2          #30          |1.. |
      0226  48 8a 00                 MOVQB  $00,vT3            #26          |H..|
      0229  11 dc 80                 LDWI   $80dc              #54          |.\.|
      022c  35 4a                    FILL                      #83128       |5J|
      # cache table address
      022e  b1 4a 08 a0              MOVIW  $08a0,$4a          #30          |1J. |
      # loop with phibrizzo's magic
      0232  4a 42 0a                 MOVQW  $0a,$42            #136832      |JB.|
      0235  1a 48                    LD     $48                #115286      |.H|
      0237  82 3e                    ANDI   $3e                #64574       |.>|
      0239  99 4a                    ADDW   $4a                #100842      |.J|
      023b  3d 46                    DEEKA  $46                #119578      |=F|
      023d  90 40                    BRA    $0242              #49906       |.@|
      023f  c6 46 20                 ADDSV  $20,$46            #534180      |FF |
      0242  4a 40 05                 MOVQW  $05,$40            #429774      |J@.|
      0245  21 46                    LDW    $46                #359430      |!F|
      0247  99 42                    ADDW   $42                #501068      |.B|
      0249  5e 54                    ST     $54                #252174      |^T|
      024b  1a 47                    LD     $47                #311006      |.G|
      024d  5e 44                    ST     $44                #301956      |^D|
      024f  90 52                    BRA    $0254              #199360      |.R|
      0251  c6 44 20                 ADDSV  $20,$44            #3188452     |FD |
      0254  1a 44                    LD     $44                #2380944     |.D|
      0256  5e 88                    ST     vT2                #2139190     |^.|
      0258  99 42                    ADDW   $42                #3165358     |.B|
      025a  b8 40                    SUBW   $40                #3679056     |8@|
      025c  5e 8a                    ST     vT3                #1751208     |^.|
      025e  1a 54                    LD     $54                #2248312     |.T|
      0260  99 40                    ADDW   $40                #3715188     |.@|
      0262  5e 8b                    ST     vT3+1              #1660428     |^.|
      0264  59 10                    LDI    $10                #2033606     |Y.|
      0266  99 46                    ADDW   $46                #3768302     |.F|
      0268  5e 89                    ST     vT2+1              #1762562     |^.|
      # blit a 32x32 block
      026a  11 20 20                 LDWI   $2020              #2440852     |.  |
      026d  35 48                    BLIT                      #1456465210  |5H|
      # more of phibrizzo's magic
      026f  93 40                    INC    $40                #2443116     |.@|
      0271  1a 40                    LD     $40                #2061384     |.@|
      0273  8c 0c                    XORI   $0c                #1762742     |..|
      0275  35 72 4f                 BNE    $0251              #3077314     |5rO|
      0278  1a 42                    LD     $42                #255752      |.B|
      027a  e6 01                    SUBI   1                  #351148      |f.|
      027c  5e 42                    ST     $42                #232532      |^B|
      027e  8c 06                    XORI   6                  #389428      |..|
      0280  35 72 3d                 BNE    $023f              #399952      |5r=|
      0283  c6 48 02                 ADDSV  $02,$48            #132590      |FH.|
      # fill the 4x4 seed with a random color
      0286  1a 06                    LD     entropy            #62374       |..|
      0288  5e 8a                    ST     vT3                #57896       |^.|
      028a  b1 88 bf 80              MOVIW  $bf80,vT2          #117282      |1.?.|
      028e  11 04 04                 LDWI   $0404              #99410       |...|
      0291  35 4a                    FILL                      #923936      |5J|
      #loop
      0293  90 30                    BRA    $0232              #63388       |.0|
      
      The # numbers are cycle counts for this instruction, augmented by whatever cycles are lost in scheduling them.
      One can use these to compute the fps rate, in this case 14 fps.
      Post Reply