Sunday, February 25, 2018

An updated version that calls the address the ROM places on the direct page + 2 opcodes.
The last 2 bytes could be dropped because they are now taken from direct page RAM.  But I included them so you can compare it to the assembler output.


0 REM01234567890123456789012345678901234567890123456
1 DATA60,54,55,150,246,129,126,38,25,220,247,195,1,5,131,1,1,221,252,206,67,113,236,1
2 DATA221,246,236,3,221,248,236,5,221,250,51,50,56,57,240,129,58,37,1,57,126
4 FORI=0TO44:READ A:POKE 17227+I,A:NEXT:STOP
5 EXEC17227


I was converting the hex bytes from the assembly output to decimal by hand, but it's time consuming.
This last time I used CoCo Emulation on MESS (MAME) to do the conversions for me.
Remove the first two columns from the output and the original code (to the right of the actual code output), added &H to each hex number, separated them by commas, and used a program to output the decimal values.
 Sorry again for the line wrapping, there isn't a simple code embedding option in the blogger editor.


1 DATA &H3C,&H36,&H37,&H96,&HF6,&H81,&H7E,&H26,&H19,&HDC,&HF7,&HC3,&H01,&H05,&H83,&H01,&H01
2 DATA &HDD,&HFC,&HCE,&H43,&H71,&HEC,&H01,&HDD,&HF6,&HEC,&H03,&HDD,&HF8,&HEC,&H05,&HDD,&HFA
3 DATA &H33,&H32,&H38,&H39,&HF0,&H81,&H3A,&H25,&H01,&H39,&H7E,&HE1,&HCC

4 FORI=0TO44:READ A:PRINTA;:NEXT


And the updated assembler output:
0001   0000             ; Simple speed up for MC-10 Microcolor BASIC
0002   0000             ; (C) 2018  James Diffendaffer
0003   0000             ; May be freely redistributed
0004   0000             ; Date:  2/25/2018
0005   0000             ; code to patch the CHRGET function on the direct page
0006   0000             ; chrget is used to parse through the BASIC code and gets called a lot.
0007   0000             ; by checking the most frequent case in RAM, it saves a 3 clock cycle jmp
0008   0000             ; to the remainder of the function in ROM.  
0009   0000             
0010   0000             ; definitions for the TASM cross assembler needed for 6803 syntax
0011   0000             .MSFIRST        ; Most Significant byte first
0012   0000             
0013   0000             #define EQU     .EQU
0014   0000             #define ORG     .ORG
0015   0000             #define RMB     .BLOCK
0016   0000             #define FCB     .BYTE
0017   0000             #define FCC     .TEXT
0018   0000             #define FDB     .WORD
0019   0000             #define END .END
0020   0000             #define FCS .TEXT
0021   0000             
0022   0000             #define equ     .EQU
0023   0000             #define org     .ORG
0024   0000             #define rmb     .BLOCK
0025   0000             #define fcb     .BYTE
0026   0000             #define fcc     .TEXT
0027   0000             #define fdb     .WORD
0028   0000             #define end .END
0029   0000             #define fcs .TEXT
0030   0000             
0031   0000             ;start of code
0032   434B              org $434B ;1st address after a REM on the first line of codeof the program.
0033   434B              ;this is constant on startup in Microcolor BASIC
0034   434B             
0035   434B 3C          pshx ;preserve register contents
0036   434C 36          psha
0037   434D 37          pshb
0038   434E 96 F6        ldaa $F6
0039   4350 81 7E        cmpa #$7E ; is it a jmp instruction?
0040   4352 26 19        bne exit ; if not we exit
0041   4354             
0042   4354 DC F7        ldd $F7 ; grab the current address JMP calls
0043   4356 C3 01 05    addd #$0105 ; hopefully this will skip the compare & branch now on the direct page
0044   4359 83 01 01    subd #$0101 ; - to avoid putting a zero in the code. 
0045   435C DD FC        std $FC ; save it at the end of the new code
0046   435E             
0047   435E CE 43 71    ldx #PATCH ; get the address of our patch code minus 1 (to avoid using LDD 0,X)
0048   4361 EC 01        ldd 1,X ; get the firs two bytes
0049   4363 DD F6        std $F6 ; save them
0050   4365 EC 03        ldd 3,x ; get the next two
0051   4367 DD F8        std $F8 ; etc...
0052   4369 EC 05        ldd 5,x
0053   436B DD FA        std $FA
0054   436D             exit:
0055   436D 33          pulb ; restore registers
0056   436E 32          pula
0057   436F 38          pulx
0058   4370             
0059   4370 39          rts ; return to BASIC
0060   4371             
0061   4371             ; contains the patch
0062   4371             PATCH:
0063   4371 F0          FCB $F0 ; dummy byte so LDD doesn't have to use LDD 0,X
0064   4372             ; org $00F6 ; the address the patch is meant to run at.  Not really needed due to relative branch.
0065   4372 81 3A        cmpa      #':' ; set Z flag if statement separator
0066   4374 25 01        bcs       AA ; perform more tests if not
0067   4376 39          rts ; return if >= ':' 
0068   4377 7E E1 CC    AA jmp $E1CC ; jump to the parser back end.  The address can be dropped 
0069   437A              ; - because we copy the current one plus 4 now
0070   437A             
0071   437A              end
0072   437A              tasm: Number of errors = 0

Thursday, February 22, 2018

Code update

A small code update.  This version checks to see if the MC-10 patch is already installed.  That way any BASIC that installs the patch itself will have the proper JMP to it' ROM stay intact.
This is the output from the assembler.
From left to right, line #, address, hex representation of the assembled code, source code.
You want to trim off the line number column, address column and original code.
Then convert the hexadecimal values to decimals.  That's what goes in the DATA statements.

*edit* this has been updated again in an attempt to improve compatibility.

0001   0000             ; Simple speed up for MC-10 Microcolor BASIC
0002   0000             ; (C) 2018  James Diffendaffer
0003   0000             ; May be freely redistributed
0004   0000             ; Date:  2/25/2018
0005   0000             ; code to patch the CHRGET function on the direct page
0006   0000             ; chrget is used to parse through the BASIC code and gets called a lot.
0007   0000             ; by checking the most frequent case in RAM, it saves a 3 clock cycle jmp
0008   0000             ; to the remainder of the function in ROM.  
0009   0000             
0010   0000             ; definitions for the TASM cross assembler needed for 6803 syntax
0011   0000             .MSFIRST        ; Most Significant byte first
0012   0000             
0013   0000             #define EQU     .EQU
0014   0000             #define ORG     .ORG
0015   0000             #define RMB     .BLOCK
0016   0000             #define FCB     .BYTE
0017   0000             #define FCC     .TEXT
0018   0000             #define FDB     .WORD
0019   0000             #define END .END
0020   0000             #define FCS .TEXT
0021   0000             
0022   0000             #define equ     .EQU
0023   0000             #define org     .ORG
0024   0000             #define rmb     .BLOCK
0025   0000             #define fcb     .BYTE
0026   0000             #define fcc     .TEXT
0027   0000             #define fdb     .WORD
0028   0000             #define end .END
0029   0000             #define fcs .TEXT
0030   0000             
0031   0000             ;start of code
0032   434B              org $434B ;1st address after a REM on the first line of codeof the program.
0033   434B              ;this is constant on startup in Microcolor BASIC
0034   434B             
0035   434B 3C          pshx ;preserve register contents
0036   434C 36          psha
0037   434D 37          pshb
0038   434E 96 F6        ldaa $F6
0039   4350 91 7E        cmpa $7E ; is it a jmp instruction?
0040   4352 26 16        bne exit ; if not we exit
0041   4354             
0042   4354 DC 7F        ldd $7F ; grab the current address JMP calls
0043   4356 C3 00 04    addd #$4 ; hopefully this will skip the compare & branch now on the direct page
0044   4359 DD FC        std $FC ; save it at the end of the new code
0045   435B             
0046   435B CE 43 6E    ldx #PATCH ; get the address of our patch code minus 1 (to avoid using LDD 0,X)
0047   435E EC 01        ldd 1,X ; get the firs two bytes
0048   4360 DD F6        std $F6 ; save them
0049   4362 EC 03        ldd 3,x ; get the next two
0050   4364 DD F8        std $F8 ; etc...
0051   4366 EC 05        ldd 5,x
0052   4368 DD FA        std $FA
0053   436A             exit:
0054   436A 33          pulb ; restore registers
0055   436B 32          pula
0056   436C 38          pulx
0057   436D             
0058   436D 39          rts ; return to BASIC
0059   436E             
0060   436E             ; contains the patch
0061   436E             PATCH:
0062   436E F0          FCB $F0 ; dummy byte so LDD doesn't have to use LDD 0,X
0063   436F             ; org $00F6 ; the address the patch is meant to run at.  Not really needed due to relative branch.
0064   436F 81 3A        cmpa      #':' ; set Z flag if statement separator
0065   4371 25 01        bcs       AA ; perform more tests if not
0066   4373 39          rts ; return if >= ':' 
0067   4374 7E E1 CC    AA jmp $E1CC ; jump to the parser back end.  The address can be dropped 
0068   4377              ; - because we copy the current one plus 4 now
0069   4377             
0070   4377              end
0071   4377              tasm: Number of errors = 0

Speed up regular Microcolor BASIC on the MC-10

No ROM replacement needed for this... but it's not a lot faster.
It uses the same patch to the CHRGET function on the direct page used in the the latest ROM.
The patch will be embedded in the REM in line 0 after the first run.
Then delete lines 9998-10000 and move the EXEC to line 1 in place of the GOSUB and save it.
It will already be embedded in the REM and there's no need to read/poke the data again.
The patch is actually resident in RAM on the direct page until you turn off the machine or it crashes.


BASIC code:

0 REM012345678901234567890123456789012345
1 GOSUB 10000
9998 DATA 60,54,55,206,67,101,236,1,221,246,236,3,221,248,236,5,221,250,236,7,221,252
9999 DATA 51,50,56,57,240,129,58,37,1,57,126,225,204
10000 FORI=0TO34:READ A:POKE 17227+I,A:NEXT:EXEC17227:RETURN


And the assembly for the TASM cross assembler:

; Simple speed up for MC-10 Microcolor BASIC
; (C)2018  James Diffendaffer
; May be freely redistributed

.MSFIRST        ; Most Significant byte first

#define EQU     .EQU
#define ORG     .ORG
#define RMB     .BLOCK
#define FCB     .BYTE
#define FCC     .TEXT
#define FDB     .WORD
#define END .END
#define FCS .TEXT

#define equ     .EQU
#define org     .ORG
#define rmb     .BLOCK
#define fcb     .BYTE
#define fcc     .TEXT
#define fdb     .WORD
#define end .END
#define fcs .TEXT

org $434B

pshx
psha
pshb
ldx #PATCH
ldd 1,X
std $F6
ldd 3,x
std $F8
ldd 5,x
std $FA
ldd 7,x
std $FC
pulb
pula
pulx

rts

PATCH
FCB $F0
org $00F6

cmpa      #':' ; set Z flag if statement separator
bcs       AA ; perform more tests if not
rts ; return if >= ':' 
AA jmp $E1CC ; jump to the parser back end

end

Wednesday, February 7, 2018

Finalizing a new ROM release

It's time for a new MC-10 ROM release.  I've squeezed about as much into 8K as possible within a reasonable amount of time.  As of now there are around 10 bytes free in the ROM, but it's not enough to implement storing the pointer to the next line... which was the only other optimization I thought might fit in 8K.


What's going to be in this release?

1. A faster divide  Some cycles were removed from the inner loop.

2. Faster Screen Scroll.  The INX INX replacement with LDAB # ABX optimization along with unrolling the loop once.  This saves over 1000 clock cycles per scroll.

3. Faster end of line handling.

4. Faster array handling thanks to a 16x16 bit multiply using the hardware multiply.

5. Some minor optimizations here and there to save space and/or clock cycles.

6. The original parsing routine, CHRGET (common to all Microsoft BASICs), was split between direct page RAM built into the 6803, and ROM.  The original code would increment the memory pointer, load a byte, and then JMP to the 2nd half of the code in ROM.   The code has been updated so the full code is copied to RAM.  Since it extends past the end of the direct page ($00FF), it tests address $0100 to see if the code copied there exists.  If not, it patches the code to JMP to the ROM.  Even if there isn't any expansion RAM at $0100, there was still room on the direct page to handle the most common case/before jumping to ROM.  It's always faster than the factory ROM, but if you have RAM in that address range, it's even faster.


How does the performance compare to the original ROM?

My original goal was a minimum of a 5% speedup across the board, and beating 1 MHz 6502 machines like the Apple II and C64 at Ahl's Benchmark.  Testing has shown the latest version to be about 8% faster on the slowest code, 10% faster on code using arrays but little math, and math intensive things like the 3D "Fedora" plot which uses a lot of multiplication take about 33% less time.  Ahl's Benchmark still sits at about 67 seconds which is where it was after the first use of the hardware multiply, though it's tough to tell with hand timing.  The MC-10 consistently benchmarks faster than 1 MHz 6502 machines running Microsoft BASIC.  Not just at Ahl's Benchmark, but everything.  The speed difference vs the original ROM is now obvious when drawing, printing, etc...

The only thing left to do on this release, is to track down a bug in the error message handling.  The error printing isn't receiving the right error codes.  This is probably due to an optimization that treats the status bits wrong.

Thursday, February 1, 2018

More 6803 optimization

One of the little used speed optimizations when writing 6803 code seems to be the replacement of multiple INX instructions with LDB #xx ABX.
Part of this is due to needing to preserve the contents of B, but because multiple INX instructions may be separated by other LDAA ,X or similar instructions.

There are several places in Microcolor basic where INX INX is used, or even more INX instructios are used.  The INX instruction takes 3 clock cycles and is 1 byte.  LDB #2 requires 2 clock cycles and is 2 bytes.  ABX is 3 clock cycles and 1 byte.  So replacing INX INX with LDB #2 ABX saves 1 clock cycle but takes 1 additional byte.  If the INX INX take place in a little use function, or one that does not impact the normal speed of execution of a program, it makes little sense to use this speed optimization.  But it the INX INX pair are inside a loop, it can save a lot of clock cycles.

Scrolling the text screen is one example.  This is not the actual interpreter's code, but it's close.

; X points to the destination address... $20 is 32, or the length of one line
LOOP:
  LDD $20,X
  STD ,X
  INX
  INX

  CPX  #ENDOFSCREEN-32
  BLT LOOP



The screen contains 32 characters / line * 16 lines, and it copies 2 bytes at a time.
Replacing INX INX with LDB #2 ABX saves 16 * 16, or 256 clock cycles over the entire screen.
However, if you unroll the loop just once, it saves an additional 6 clock cycles per pass, and cuts the number of CPX, BNE, and INX equivalent clock cycles in half!  So savings go from 256 to well over 1000 at the cost of 5 bytes over the original code.  That's at least enough clock cycles to execute another 250 more instructions somewhere else.

LOOP:
  LDD $20,X
  STD ,X
  LDD $22,X
  STD 2,X
  LDB #4
  ABX
  CPX #ENDOFSCREEN-32
  BLT LOOP



That is an obvious case, it is less obvious where the INX instructions are split over many lines of code.  This is from Microcolor BASIC.

;* End of command or program line
LE52A
          inx                           ; advance past the end-of-line terminator 3
          ldaa      ,X                  ; get MSB of 'next line' link 4
          inx                           ; advacne to LSB 3
          oraa      ,X                  ; OR in the LSB of the 'next line' link 4
          staa      ENDFLG              ; clear ENDFLG if end of program 3
          beq       LE589               ; goto END if no more program lines 3
;* Start next program line
          inx                           ; point X to new line number 3
          ldd       ,X                  ; get new line number..
          std       CURLIN              ; ..and store in CURLIN
          inx                           ; advance to LSB of line number 3
          stx       CHRPTR              ; set parser position to start of line -1

This can be replaced with a shorter and faster version.  I had to verify the contents of X and B were not required in LE589 or the code this falls through to, but 11 lines have been replaced with 9, and only 7 are executed most of the time.  4 INX instructions were replaced here.  Savings aren't quite so significant within the scope of the code, but this gets executed at the end of every line of BASIC code, so it adds up over time.

;* End of command or program line
LE52A
ldd 1,X ; get 'next line' link 5
bne LE52B ; zero = no more program lines (the LDD sets flags this tests) 3
staa         ENDFLG ; clear ENDFLG, we are at the end of the program 3
bra LE589 ; goto END if no more program lines 3
LE52B
;* Start next program line
        ldd       3,X                  ; get new line number..
        std       CURLIN              ; ..and store in CURLIN
ldab     #4 ; size of line terminator + next line link + 1 2
abx ; point X to new line number 3
        stx       CHRPTR              ; set parser position to start of line -1