SYS_CopyMemory and SYS_CopyMemoryExt

Using, learning, programming and modding the Gigatron and anything related.
Forum rules
Be nice. No drama.
Post Reply
lb3361
Posts: 52
Joined: 17 Feb 2021, 23:07

SYS_CopyMemory and SYS_CopyMemoryExt

Post by lb3361 »

I wrote two useful native routines inspired by SYS_SetMemory_v2_54.

SYS_CopyMemory copies a block of memory from one address to another. The main caveat is that you cannot cross page boundaries, which means that the size of the block must be less than 0x100-lo(srcAddr) and 0x100-lo(dstAddr). It can read and write 12 bytes per scanline in two bursts of 6 bytes. For comparison SYS_SetMemory writes 24 bytes per scanline in three bursts of 8 bytes. The implementation has a neat trick. Instead of restarting the syscall in the usual way, it checks that there is enough time left, patch vTicks, and starts another burst of 6 bytes. That saves a dozen cycles per burst. When it cannot burst 6 bytes, it tries to burst 3, then 1.

Code: Select all

#-----------------------------------------------------------------------
# Extension SYS_CopyMemory_DEVROM_80
#-----------------------------------------------------------------------

# SYS function for copying 1..256 bytes
#
# sysArgs[0:1]    Destination address
# sysArgs[2:3]    Source address
# vAC[0]          Count (0 means 256)
#
# Doesn't cross page boundaries
# Overwrites sysArgs[4:7] and vLR

SYS_CopyMemoryExt only works when a RAM expansion is detected. It copies a block of memory in a manner similar to SYS_CopyMemory but writes it into another bank. It does this by reading a burst, using ctrl() to switch bank, writing the burst, then using ctrl() again to restore the original bank. To make both extensions fit in a single page, I cut the 3 bytes burst for this one.

Code: Select all

#-----------------------------------------------------------------------
# Extension SYS_CopyMemoryExt_DEVROM_94
#-----------------------------------------------------------------------

# SYS function for copying 1..256 bytes to a different bank
#
# sysArgs[0:1]    Destination address
# sysArgs[2:3]    Source address
# vAC[0]          Count (0 means 256)
# vAC[1]          Bits 7 and 6 contain the bank number
#
# Doesn't cross page boundaries.
# Overwrites sysArgs[4:7], vLR, and vTmp.
# Returns -1 in vAC if no expansion card is present.
Question: can I send a pull request? I am quite eager to have this in devrom because copying across memory banks is something painful to do without using an intermediate buffer in the first 32KB of memory which are already very busy. In addition, having such a native routine allows us to load a GT1 into the memory of a 32KB or 64KB gigatron without having to also find a spot for the loader. The loader can hide in another bank and write into banks 0 and 1 with this call.

I took the SYS spots previously used by the unfinished SYS_LoadBytes and SYS_StoreBytes. This can be changed.

You can check the code at https://github.com/lb3361/gigatron-rom/ ... fbbf1798eb.

I also have two vcpu routines, named memcpy(...) and _memcpyext(...), that split a copy into pieces that don't cross page boundaries. This is what is used by my testing code which can be found at https://github.com/lb3361/gigatron-lcc/ ... emcpyext.c. Screenshot of the test in progress below. Yes this is a working c compiler :-)
Screenshot from 2021-05-30 00-30-57.png
Screenshot from 2021-05-30 00-30-57.png (24.97 KiB) Viewed 97 times
at67
Posts: 332
Joined: 14 May 2018, 08:29

Re: SYS_CopyMemory and SYS_CopyMemoryExt

Post by at67 »

lb3361 wrote: 30 May 2021, 04:38 I wrote two useful native routines inspired by SYS_SetMemory_v2_54.
You can create a pull request at anytime, I already have 3 SYS memcpy routines for bytes/words/dwords, but I will replace them with your new more efficient routines in the new ROM.
lb3361
Posts: 52
Joined: 17 Feb 2021, 23:07

Re: SYS_CopyMemory and SYS_CopyMemoryExt

Post by lb3361 »

Done.

Your dword routine might have comparable speed and lower startup costs. The maximal speed depends a lot on how many invocations fit in the 148 cycles of the typical runVCpu call. This is why I eventually went with bursts of 6 bytes. I could not fit two bursts of 8 bytes in 148 cycles. But two bursts of 6 followed by a restart take 130 cycles, with 18 cycles to spare. So maybe three bursts of 4 can go equally fast.

This takes us to the startup times. Copies of less than six bytes use a normal restart approach for code size. One could shave cycles there.

Here are the times I see for aligned copies:

Code: Select all

1 byte:  48
2 bytes: 48+48      <--- a pure word routine could be faster here 
3 bytes: 58
4 bytes: 58+48      <--- a pure dword routine could be faster here (26.5 cycles/byte)
5 bytes: 58+48+48
6 bytes:  68
7 bytes:  100
8 bytes:  100+48    <--- a pure dword routine could be faster here (19 cycles/byte)
9 bytes:  110
10 bytes: 110+48
11 bytes: 110+48+48
12 bytes: 120
13 bytes: 130 + 48
...
n bytes:  130 * (n // 12) +  (-10, 48, 96, 58,106, 154, 68, 100, 148, 110, 158, 206) [ n % 12 ]
It asymptotes a bit below 11 cycles per byte copied but can be around 19 cycles/byte for n=8 and 11. By avoiding the normal approach, I could shave 16 cycles for n=2,4,8, or 10, and 32 cycles for n=5 or 11, at the expense of code size. The CopyMemoryExt is slower because it does not have the three bytes bursts, but the vcpu code to do the same is so horrible that it matters less.

By the way, your native code debugger in gtemuAT67 is cool.

- L.
Post Reply