Wednesday, December 13, 2017

Another patch to speed up the Plus/4 ROM.

There are a group of functions that the Plus/4 ROM copies to RAM on startup.  Each of these allows access to all of RAM, including under ROM, via different pointers stored on page 0 (the direct page in Motorola terms).

Each of these functions disables interrupts, pages out the ROM, loads A via a pointer on page zero, pages in ROM, enables interrupts, and returns to the caller.  It's a lot of clock cycles to access a single byte.

For programs + data that are small enough to fit in memory without using RAM under ROM, we can patch the ROM to skip the costly sequence of instructions so that it directly accesses memory.

The patch must copy ROM to RAM, set the highest address available to BASIC so that it is before the start address of the ROM, and then install the code listed below.

Each function that is called is replaced with a piece of code that loads A with the page zero address that function uses, then it calls a common patch routine that overwrites the ROM code we copied to RAM so that it directly loads A without the intermediate call.  Each JSR in ROM occupies 3 bytes.  The opcode for JSR, and 2 bytes for the address to call.  The LDA (),Y opcode takes 2 bytes.  So we must overwrite the 3rd byte with a NOP.  Using this approach will result in all calls to the load routines being patched the first time they are called. 

The resulting patched code only requires 6 clock cycles instead of the 29 clock cycles the regular ROM requires.  The beauty of this approach is that it patches every call to these functions without us having to find them all.

Warning: This assumes that the ROM does not use any optimizations where a JMP was used to call the code so that it would eliminate the need for two RTS instructions.  If we discover that technique was used somewhere, we must identify it and place an RTS after the LDA (),Y instead of a NOP.


; stub routine replacing each piece of code ROM calls
LDA #address ; load A with address normally used by LDA (address),Y
JMP PATCH ; call the patcher

; code to patch the ROM where it calls functions to access RAM under ROM
PATCH:
STA #temp ; save page zero address to use
LDY #0 ; zero y

; point to code we want to patch
;  Get LSB of return address from stack and adjust it to address of JSR
PLA ; get LSB of return address
SBC #3 ; subtract 3 (point to address of JSR)
STA RETURNADDRESS ; store it in our own page 0 pointer

;  Get MSB or return address from stack and adjust it if carry set
PLA ; get MSB of return address
BCC NEXT ; deal with carry from MSB
DEC
NEXT:
; Push MSB of JSR address onto the stack
PHA

; patch 1st byte with LDA (),Y opcode
STA RETURNADDRESS+1 ; save it in our pointer
LDA #$B1 ; load the opcode we want to patch with
STA (RETURNADDRESS),Y ; patch BASIC

; patch 2nd byte with address passed from stub routine
INY ; next address 
LDA TEMP ; get the address that was passed to us
STA (RETURNADDRESS),Y ; patch BASIC

; patch 3rd byte with NOP
INY ; next address
LDA #$EA ; NOP to finish the patch
STA (RETURNADDRESS),Y ; patch it

; Push LSB of patched code to stack and call it with RTS
LDA RETURNADDRESS ; get LSB of JSR
PHA ; push it to the stack

RTS ; call the patched code

Thursday, December 7, 2017

Status update of a few projects

The USB serial adapter finally showed up so I can start working on the 68HC11 port of BASIC.  The IDE port is going to be on hold until I get that up and running.

The comments for the VZ disassembly are ready to go, I just need to extract the ones that match the VZ ROM and put them in the sed file.  Then it's just a matter of commenting a few pieces of code VTEC added to the ROM.  One of the things I ran across when benchmarking these old machines is how horribly slow the VZ BASIC is.  Even though it's clocked faster than the TRS-80 Model III and they both share a lot of code, the VZ takes almost twice as long on benchmarks.  The patches VTEC made clearly didn't help it.  Once I convert the disassembly back to a source file, I should be able to run a code profiler on it to see where the biggest bottleneck is and to fix it.  There's plenty of unused space for fixes.


Kicking a dead horse... Commodore Plus/4 style

Here's a little patch to speed up the Commodore Plus/4.  This modifies the RAM based CHRGOT function that is used to scan through the BASIC code.  The standard code disables interrupts, pages out the ROM, reads a byte, pages in the ROM, and enables interrupts for every byte of a program it reads.   It does this so that it can provide up to 60K for BASIC.  This is certainly a nice feature if you need that much RAM, but if you don't, it slows down programs significantly for no reason.

This simple piece of code speeds up one benchmark by about 4%.  It is a pretty significant gain for a few hours work and requires no changes to the ROM.  Getting this much extra speed out of the MC-10 was a lot harder and requires a new ROM.  Actual performance increases will vary by program.  Still, I had hoped for better results.

Only programs that fit in memory below the start address of ROM will work with this as it eliminates the code that pages ROM in and out.  It makes no attempt to modify system variables to restrict code to that area, and it does not restrict the use of upper RAM for data.  Additional patches that restrict data to the same area of RAM would provide additional speed.


Here is the original code used by the Plus/4 BASIC interpreter from a ROM disassembly.  We are most interested in the code starting at $8129 in the ROM.  This is copied to RAM on startup:

        ; CHRGET/CHRGOT - This chunk of code is copied to RAM 
        ; and run from there. It is used to get data UNDER the 
        ; system ROM's for basic.
        ;
; CHRGET ($0473)
L8123   INC   LastBasicLineNo         ; $3b (goes to $0473 ) CHRGET
        BNE   L8129
        INC   LastBasicLineNo+1       ; $3c
;
; CHRGOT ($0479)
;
L8129   SEI    
        STA   RAM_ON
        LDY   #$00
        LDA   (LastBasicLineNo),y     ; $3b
        STA   ROM_ON
        CLI    
        CMP   #$3A   ; ":" (colon)
        BCS   L8143   ; if colon, exit
        CMP   #$20   ; " " (space)
        BEQ   L8123   ; if space, get NEXT byte from basic
        SEC    
        SBC   #$30
        SEC    
        SBC   #$D0
L8143   RTS    



This contains the new CHRGOT function.  It's code was embedded in the BYTE section of the patch that follows this listing.  Note that code is designed to exit without any branches for the most commonly found type of byte.  This saves a clock cycle for every such byte as branch taken requires one more clock cycle than not taken.

00000r 1                .ORG $0473
000473  1               
000473  1                ;
000473  1                ; CHRGET/CHRGOT - This chunk of code is copied to RAM
000473  1                ; and run from there. It is used to get data UNDER the
000473  1                ; system ROM's for basic.
000473  1                ;
000473  1                ; CHRGET ($0473)
000473  1               L8123:
000473  1  E6 3B         INC LastBasicLineNo ; $3b (goes to $0473 ) CHRGET
000475  1  D0 02         BNE L8129
000477  1  E6 3C         INC LastBasicLineNo+1 ; $3c
000479  1                ;
000479  1                ; CHRGOT ($0479)
000479  1                ;
000479  1               L8129:
000479  1  A0 00         LDY #$00
00047B  1  B1 3B         LDA (LastBasicLineNo),y ; $3b
00047D  1  C9 3A         CMP #$3A ; Larger than $3A?
00047F  1  90 01         BCC NEXT ; if not, skip to NEXT
000481  1  60            RTS ; return if so
000482  1               NEXT:
000482  1  E9 2F         SBC #$2F ; A=A-$30
000484  1  C9 F0         CMP #$F0 ; Is it a " "? (space)
000486  1  F0 EB         BEQ L8123 ; if space, get NEXT byte from basic
000488  1  38            SEC ; A=A-$D0
000489  1  E9 D0         SBC #$D0 ; clear carry if digit, set otherwise
00048B  1  60            RTS
00048C  1               
00048C  1                .end



This is the source code for the program that patches the CHRGOT function.  It is designed to be embedded in a REM statement in the first line of a BASIC program.  Note that the 2nd byte of the actual CHRGOT code has been changed from $00 to $01 and is patched once it is copied to it's final destination.  Microsoft BASICs don't advance to the next line by using the pointer stored at the start of the line once it starts to parse a line.  It scans for the end of line marker which is $00.  It assumes anything that follows is a line of BASIC code.  Storing the byte as non $00 is required so BASIC can skip to the next line every time the program runs.

000000r 1               
000000r 1                .org $1006 ; The address of ML$ in our BASIC program
001006  1               
001006  1  A0 12         LDY #18 ; Starts at CHRGOT+18 and works down...
001008  1               NEXT:
001008  1  B9 17 10      LDA CHRGOT,Y ; Get byte of new CHRGOT routine
00100B  1  99 79 04      STA $0479,Y ; Save it over the old routine
00100E  1  88            DEY ; decrement our loop counter/index register
00100F  1  10 F7         BPL NEXT
001011  1  C8            INY
001012  1  98            TYA
001013  1  99 7A 04      STA $047A,Y
001016  1  60            RTS
001017  1               CHRGOT:
001017  1  A0 01 B1 3B   .BYTE $A0,$01,$B1,$3B,$C9,$3A,$90,$01,$60,$E9,$2F,$C9,$F0,$F0,$EB,$38,$E9,$D0,$60
00101B  1  C9 3A 90 01  
00101F  1  60 E9 2F C9  
00102A  1                .end


This is the final BASIC code containing the patch.  It can be added to smaller programs to sped them up.  After the first time the program has been run, the lines containing the DATA statements, and line 1 can be deleted.  The resulting program can be saved with the patch permanently embedded in the REM statement.

0 REM012345678901234567890123456789012345
1 FORI=0 TO 35:READ T:POKE 4102+I,T :NEXT I
2 SYS 4102

10000 DATA 160,18,185,23,16,153,121,4,136,16,247,200,152,153,122,4,96
10010 DATA 160,01,177,59,201,58,144,1,96,233,47,201
10020 DATA 240,240,235,56,233,208,96



Here is the benchmark that prompted me to write this.

10 K=0:I=0:T=0:P=0
30 SCNCLR
100  PRINT "Prime Number Generator"
110  INPUT "Upper Limit";N

120  eTime=TIME
130  T=(N-3)/2
140  DIMA(T+1)

160 FORI=0TOT:A(I)=0:NEXT
200 FORI=0TOT:IFA(I)THENPRINT"..";:NEXT:GOTO330
210P=I+I+3:PRINTP;".";:K=I+P:IFK<=TTHENFORK=KTOTSTEPP:A(K)=1:NEXT:NEXT:GOTO330
260 NEXT

330  eTime=(TIME-eTime)/60
340  PRINT
350  PRINT "Total: ";eTime
360 END


This will speed up the benchmark by over 30%.  It disables the screen refresh while the benchmark is running.  The screen refresh normally steals that many clock cycles away from the CPU.
115 POKE65286,PEEK(65286)AND239
335 POKE65286,PEEK(65286)OR16