I don't have 64K hardware at home, but I used dkholf's gtrun to test on my macbook. That one uses the loader/babelfish protocol on the IN port and Loader itself to pump files into memory. (Quite similar to Phil's javascript emulator). I believe yours does it in a somewhat different way. If so, that can explain differences.
My theory is that if Boing incorrectly starts with just 1 channel, that channel will be updated 4 times as often and sound 2 octaves higher. But the sample XOUT frequency doesn't change, because that is still every 4 scanlines. This might cause the aliasing effects that I hear. So for some reason it starts with 4 channels in ROM v4, but with 1 channel in ROM v5a. That's the hypothesis at least...