Sunday, April 23, 2017

Status update.

I really haven't spent much time on anything new since the last post.
There is a bug that seems to be part of the original conversion of the disassembly back to assembly code that did not appear in initial tests, but now it has become a show stopper.  This has forced me to go back to my original goal of creating a version of the source code that generates an identical binary to the original ROM.  Then I can test individual changes against that code instead of everything all at once.  Most of that work is complete but I probably have another 25-50 changes to go before it's ready.   It's a couple hours work at the most.

One of the easiest optimizations that came out of this effort was simply using an assembler that properly generates code for direct page references.  I'm actually had to force the assembler to generate the original opcodes using FCB and FDB.  This alone probably saved enough space vs the Easter eggs to implement the ELSE statement.  That's the first change I'll make to the code.
Then I'll insert the ELSE code, followed by a few small optimizations that use D where it's definitely safe.

Two of the optimizations that I want to use involve the CHARGET and storing the pointer to the next line.  They will certainly offer a speedup.  But to fully support the CHARGET changes requires looking at dozens of subroutines to find out what the X register contains for each one.  This is going to take a notebook or whiteboard and several days work.  Given the time constraints of the contest, it will have to be in a later release.  Storing the pointer to the next line is almost complete, but there is still one condition where it is failing and I need to see what is still messing up the FOR NEXT stack.  This optimization may also have to wait.

Everything previously mentioned will certainly offer a small performance increase..  I've also written faster versions of the code that moves strings and made a lot of other modifications that can save clock cycles.  When all those are rolled out, Microcolor BASIC will certainly be more responsive.   Running two copies of the VMC10 emulator side by side running the same program show a small but noticeable difference.

I have not worked on code to speedup the search for line numbers.  I looked at a simple design on paper and I probably need another 50 to 75 bytes for that.  It's just not going to fit in the limited ROM space.  Maybe some future ah-ha moments will free up enough space, but I have my doubts.

This brings us to the real bottleneck which is the math library.  I've made several optimizations but they all have to be retested from scratch thanks to the bug I mentioned.  So that's another thing that may not make the release.  Sadly, the fastest approach on paper to many of these routines is probably going to exceed available space.  It's an 8K ROM, not a Tardis.

Ultimately, I should have a release that supports ELSE and is faster soon.  It won't be as fast as I'd like, but there will certainly be follow up releases as I test the other optimizations.  Hopefully some of the better ones will be ready before the end of the contest.  The fixes to the bad Microsoft math constants can already be used as an option, so that should be in the release as well.

Monday, April 10, 2017

And now I messed up the FOR NEXT stack frame when I moved some stuff related to the next line pointer around.  This should be fun to track down.  <sigh>

Another speed optimization I'm working on involves parsing every character via the CHRGET subroutine.  CHRGET is a piece of self modifying code that is copied from ROM to the direct page at startup.  The function looks like this:

;* Byte parser subroutine utilizing self-modifying code.
;* This routine is copied to RAM at CHRGET ($00EB) during cold start.
          fcb       INIDAT-PARSER       ; number of bytes to copy
PARSER    inc       CHRPTR+1            ; increment LSB of parse location
          bne       LF7D8               ; branch if no carry
          inc       CHRPTR              ; increment MSB of parse location
;     ldaa      $0000              ; load byte from parse location into ACCA
fcb $B6,$00,$00
          jmp       BPARSE              ; call back-end of parser routine in ROM

If we can place CHRPTR in X, we can move this to ROM right in front of BPARSE and reduce it from 22(?) clock cycles to 7(?) clock cycles.  If we don't have to update CHRPTR until we exit BPARSE, we save even more clock cycles since BPARSE loops back to CHARGET.
This also only resulted in a 1 byte increase in code size thanks to removal of a couple STX CHPTR instructions elsewhere.  Sadly, only 2 out of 20+ calls can use this optimization without major changes, and neither will speed up program execution.

ldaa ,X ; get the next character

If someone builds an MC-10 clone using a 68HC11 based microcontroller (as has been suggested in the MC-10 yahoo group), the code could easily be optimized my placing CHRPTR in the Y register.  Then every update to CHRPTR just involves loading Y or incrementing Y, and CHRGET becomes INY LDAA ,Y.
This should speed up the interpreter significantly.
I asked the MC-10 yahoo group what other changes to add if there is sufficient ROM space after ELSE support.  The only requested change so far is for correction of the SIN coefficients.  Microsoft apparently used the wrong values in many (all?) versions of BASIC.

Here are the changes.  The commented out data ontains the original values, and it is followed by the replacement table thanks to Neil.
SINCOF    fcb       6-1                 ; six coefficients in the table
;          fcb       $84,$E6,$1A,$2D,$1B ; -14.38139067 -(2*Pi)^11/11!
;          fcb       $86,$28,$07,$FB,$F8 ;  42.00779712  (2*Pi)^9 / 9!
;          fcb       $87,$99,$68,$89,$01 ; -76.70417026 -(2*Pi)^7 / 7!
;          fcb       $87,$23,$35,$DF,$E1 ;  81.60522369  (2*Pi)^5 / 5!
;          fcb       $86,$A5,$5D,$E7,$28 ; -41.34170210 -(2*Pi)^3 / 3!
;          fcb       $83,$49,$0F,$DA,$A2 ;   6.28318531  (2*Pi)^1 / 1!
fcb $84,$F1,$83,$A7,$EF ; -15.094642578 -(2*Pi)^11/11
fcb $86,$28,$3C,$1A,$44 ;  42.058693944  (2*Pi)^9 / 9
fcb $87,$99,$69,$66,$73 ; -76.705859753 -(2*Pi)^7 / 7
fcb $87,$23,$35,$E3,$3C ;  81.605249276  (2*Pi)^5 / 5
fcb $86,$A5,$5D,$E7,$31 ; -41.341702240 -(2*Pi)^3 / 3
fcb $83,$49,$0F,$DA,$A2 ;   6.283185307  (2*Pi)^1 / 1

Friday, April 7, 2017

Most of the math library depends on maintaining the value of A or B registers.
This makes it difficult to rewrite most of the functions to use D without rewriting a significant portion of the math library.  Since the Retro Challenge only runs for a month and I'm only doing this in my spare time, it is unlikely I will be able to implement all changes required for a significant speed improvement, but it is definitely a little faster already.

Tuesday, April 4, 2017

Another thing I've noticed about the way the interpreter was written, is that it doesn't take advantage of Motorola's indexed addressing to optimize some code.
It is common for the interpreter to do something like this:

* End of command or program line
           inx                           ; advance past the end-of-line terminator
           ldaa      ,X                  ; get MSB of 'next line' link
           inx                           ; advacne to LSB
           oraa      ,X                  ; OR in the LSB of the 'next line' link
           staa      ENDFLG              ; clear ENDFLG if end of program
           beq       LE589               ; goto END if no more program lines
 * Start next program line
           inx                           ; advance to LSB of line number
           inx                           ; point X to new line number
           ldd       ,X                  ; get new line number..
           std       CURLIN              ; ..and store in CURLIN
           stx       CHRPTR              ; set parser position to start of line -1

The author does not seem to realize that
    LDAA   ,X
is the same as
    LDAA   0,X

If there are several INX instructions used together like this, then the following code is faster and smaller.  Note that the code at LE589 doesn't care that X has not been updated..
;* End of command or program line
         ldaa 1,X
          oraa      2,X                  ; OR in the LSB of the 'next line' link
          staa      ENDFLG              ; clear ENDFLG if end of program
          beq       LE589               ; goto END if no more program lines
;* Start next program line
          ldd       3,X                 ; get new line number..
          std       CURLIN              ; ..and store in CURLIN
          ldab #4                   ; advance to LSB of line number
          stx       CHRPTR              ; set parser position to start of line -1

The 4 INX instructions in the original code require 12 clock cycles.  LDAB ABX requires 5 clock cycles.

The final code also saves the pointer to the next line since it only takes 5 more clock cycles to load the extra byte and save the pointer.  The end result is that the new code actually takes a fewer number of clock cycles than the old code even though it saves the pointer to the next line, and the new code only requires one additional byte. 

April 4, 1997 Update

Microcolor BASIC does some things slowly to conserve space in the ROM and memory.
A perfect example of this has to do with how the program skips over the remainder of a line.
The interpreter reads a byte at a time until if finds the 0 (zero) line terminator, then it increments the current position by 1 to point to the next line.  
This happens for REM, ON GOTO, ON GOSUB, IF THEN, etc... so obviously it will also happen on lines that use IF THEN ELSE.
So how do we speed this up?

The interpreter stores program lines in the following format:
16 bit word pointing to next line
16 bit word holding the current line number
tokenized BASIC code for the line
0 line terminator

Even thought the interpreter has access to a pointer to the next line, it does not use it except when searching for a line number.

A faster way to deal with this is to save the pointer to the next line when BASIC starts interpreting a new line, and then load that pointer instead of searching for the end of the line.
This change adds an extra 4 clock cycles at the start of each line, but calling the search for an end of line marker might take a hundred clock cycles, and removal of a related opcode in the search function reduces each call by 2 clock cycles.  If the interpreter has to search for ":" multiple times, it's even faster.  The longer the lines and more complex the code, the greater the improvement in speed.

There is a drawback to this.  It requires two more bytes on the direct page to store the pointer to the next line.  But BASIC doesn't directly allow access to that, and modern RAM expansions provide more direct page space anyway, so it shouldn't prove to be too much of an inconvenience.

Saturday, April 1, 2017

The new ROM is crashing do to incorrect code being created for the CHARGET code in ROM.
I forced it to create the correct LDAA instruction using FCB.

Some testing will be required before I know everything works, but I can enter and run a simple hello world program.

The assembler is is optimizing all calls to CHARGET as direct addressing, where the ROM uses extended addressing calls.  This saves a byte and clock cycle for every JSR call to the routine, so rather than force extended addressing with FCB, if tests don't reveal any issues, I'll call the first milestone met.  There are a lot of calls to this routine and my time would be better spent on something other than removing an easy optimization.

There are already 71 bytes free with the existing IF handler in place.  There should already be enough space for the new code to handle ELSE, and hopefully, to implement some more radical speed optimizations to the code.