A Bitbanger's Blog: 2017

Wednesday, December 13, 2017

Another patch to speed up the Plus/4 ROM.

There are a group of functions that the Plus/4 ROM copies to RAM on startup. Each of these allows access to all of RAM, including under ROM, via different pointers stored on page 0 (the direct page in Motorola terms).

Each of these functions disables interrupts, pages out the ROM, loads A via a pointer on page zero, pages in ROM, enables interrupts, and returns to the caller. It's a lot of clock cycles to access a single byte.

For programs + data that are small enough to fit in memory without using RAM under ROM, we can patch the ROM to skip the costly sequence of instructions so that it directly accesses memory.

The patch must copy ROM to RAM, set the highest address available to BASIC so that it is before the start address of the ROM, and then install the code listed below.

Each function that is called is replaced with a piece of code that loads A with the page zero address that function uses, then it calls a common patch routine that overwrites the ROM code we copied to RAM so that it directly loads A without the intermediate call. Each JSR in ROM occupies 3 bytes. The opcode for JSR, and 2 bytes for the address to call. The LDA (),Y opcode takes 2 bytes. So we must overwrite the 3rd byte with a NOP. Using this approach will result in all calls to the load routines being patched the first time they are called.

The resulting patched code only requires 6 clock cycles instead of the 29 clock cycles the regular ROM requires. The beauty of this approach is that it patches every call to these functions without us having to find them all.

Warning: This assumes that the ROM does not use any optimizations where a JMP was used to call the code so that it would eliminate the need for two RTS instructions. If we discover that technique was used somewhere, we must identify it and place an RTS after the LDA (),Y instead of a NOP.

; stub routine replacing each piece of code ROM calls
LDA #address ; load A with address normally used by LDA (address),Y
JMP PATCH ; call the patcher

; code to patch the ROM where it calls functions to access RAM under ROM
PATCH:
STA #temp ; save page zero address to use
LDY #0 ; zero y

; point to code we want to patch
; Get LSB of return address from stack and adjust it to address of JSR
PLA ; get LSB of return address
SBC #3 ; subtract 3 (point to address of JSR)
STA RETURNADDRESS ; store it in our own page 0 pointer

; Get MSB or return address from stack and adjust it if carry set
PLA ; get MSB of return address
BCC NEXT ; deal with carry from MSB
DEC
NEXT:
; Push MSB of JSR address onto the stack
PHA

; patch 1st byte with LDA (),Y opcode
STA RETURNADDRESS+1 ; save it in our pointer
LDA #$B1 ; load the opcode we want to patch with
STA (RETURNADDRESS),Y ; patch BASIC

; patch 2nd byte with address passed from stub routine
INY ; next address
LDA TEMP ; get the address that was passed to us
STA (RETURNADDRESS),Y ; patch BASIC

; patch 3rd byte with NOP
INY ; next address
LDA #$EA ; NOP to finish the patch
STA (RETURNADDRESS),Y ; patch it

; Push LSB of patched code to stack and call it with RTS
LDA RETURNADDRESS ; get LSB of JSR
PHA ; push it to the stack

RTS ; call the patched code

Thursday, December 7, 2017

Status update of a few projects

The USB serial adapter finally showed up so I can start working on the 68HC11 port of BASIC. The IDE port is going to be on hold until I get that up and running.

The comments for the VZ disassembly are ready to go, I just need to extract the ones that match the VZ ROM and put them in the sed file. Then it's just a matter of commenting a few pieces of code VTEC added to the ROM. One of the things I ran across when benchmarking these old machines is how horribly slow the VZ BASIC is. Even though it's clocked faster than the TRS-80 Model III and they both share a lot of code, the VZ takes almost twice as long on benchmarks. The patches VTEC made clearly didn't help it. Once I convert the disassembly back to a source file, I should be able to run a code profiler on it to see where the biggest bottleneck is and to fix it. There's plenty of unused space for fixes.

Kicking a dead horse... Commodore Plus/4 style

Here's a little patch to speed up the Commodore Plus/4. This modifies the RAM based CHRGOT function that is used to scan through the BASIC code. The standard code disables interrupts, pages out the ROM, reads a byte, pages in the ROM, and enables interrupts for every byte of a program it reads. It does this so that it can provide up to 60K for BASIC. This is certainly a nice feature if you need that much RAM, but if you don't, it slows down programs significantly for no reason.

This simple piece of code speeds up one benchmark by about 4%. It is a pretty significant gain for a few hours work and requires no changes to the ROM. Getting this much extra speed out of the MC-10 was a lot harder and requires a new ROM. Actual performance increases will vary by program. Still, I had hoped for better results.

Only programs that fit in memory below the start address of ROM will work with this as it eliminates the code that pages ROM in and out. It makes no attempt to modify system variables to restrict code to that area, and it does not restrict the use of upper RAM for data. Additional patches that restrict data to the same area of RAM would provide additional speed.

Here is the original code used by the Plus/4 BASIC interpreter from a ROM disassembly. We are most interested in the code starting at $8129 in the ROM. This is copied to RAM on startup:

; CHRGET/CHRGOT - This chunk of code is copied to RAM
; and run from there. It is used to get data UNDER the
; system ROM's for basic.
;
; CHRGET ($0473)
L8123 INC LastBasicLineNo ; $3b (goes to $0473 ) CHRGET
BNE L8129
INC LastBasicLineNo+1 ; $3c
;
; CHRGOT ($0479)
;
L8129 SEI
STA RAM_ON
LDY #$00
LDA (LastBasicLineNo),y ; $3b
STA ROM_ON
CLI
CMP #$3A ; ":" (colon)
BCS L8143 ; if colon, exit
CMP #$20 ; " " (space)
BEQ L8123 ; if space, get NEXT byte from basic
SEC
SBC #$30
SEC
SBC #$D0
L8143 RTS

This contains the new CHRGOT function. It's code was embedded in the BYTE section of the patch that follows this listing. Note that code is designed to exit without any branches for the most commonly found type of byte. This saves a clock cycle for every such byte as branch taken requires one more clock cycle than not taken.

00000r 1 .ORG $0473
000473 1
000473 1 ;
000473 1 ; CHRGET/CHRGOT - This chunk of code is copied to RAM
000473 1 ; and run from there. It is used to get data UNDER the
000473 1 ; system ROM's for basic.
000473 1 ;
000473 1 ; CHRGET ($0473)
000473 1 L8123:
000473 1 E6 3B    INC LastBasicLineNo ; $3b (goes to $0473 ) CHRGET
000475 1 D0 02    BNE L8129
000477 1 E6 3C    INC LastBasicLineNo+1 ; $3c
000479 1 ;
000479 1 ; CHRGOT ($0479)
000479 1 ;
000479 1 L8129:
000479 1 A0 00    LDY #$00
00047B 1 B1 3B    LDA (LastBasicLineNo),y ; $3b
00047D 1 C9 3A    CMP #$3A ; Larger than $3A?
00047F 1 90 01    BCC NEXT ; if not, skip to NEXT
000481 1 60 RTS ; return if so
000482 1 NEXT:
000482 1 E9 2F    SBC #$2F ; A=A-$30
000484 1 C9 F0    CMP #$F0 ; Is it a " "? (space)
000486 1 F0 EB    BEQ L8123 ; if space, get NEXT byte from basic
000488 1 38 SEC ; A=A-$D0
000489 1 E9 D0    SBC #$D0 ; clear carry if digit, set otherwise
00048B 1 60 RTS
00048C 1
00048C 1 .end

This is the source code for the program that patches the CHRGOT function. It is designed to be embedded in a REM statement in the first line of a BASIC program. Note that the 2nd byte of the actual CHRGOT code has been changed from $00 to $01 and is patched once it is copied to it's final destination. Microsoft BASICs don't advance to the next line by using the pointer stored at the start of the line once it starts to parse a line. It scans for the end of line marker which is $00. It assumes anything that follows is a line of BASIC code. Storing the byte as non $00 is required so BASIC can skip to the next line every time the program runs.

000000r 1
000000r 1 .org $1006 ; The address of ML$ in our BASIC program
001006 1
001006 1 A0 12    LDY #18 ; Starts at CHRGOT+18 and works down...
001008 1 NEXT:
001008 1 B9 17 10 LDA CHRGOT,Y ; Get byte of new CHRGOT routine
00100B 1 99 79 04 STA $0479,Y ; Save it over the old routine
00100E 1 88 DEY ; decrement our loop counter/index register
00100F 1 10 F7    BPL NEXT
001011 1 C8 INY
001012 1 98 TYA
001013 1 99 7A 04 STA $047A,Y
001016 1 60 RTS
001017 1 CHRGOT:
001017 1 A0 01 B1 3B   .BYTE $A0,$01,$B1,$3B,$C9,$3A,$90,$01,$60,$E9,$2F,$C9,$F0,$F0,$EB,$38,$E9,$D0,$60
00101B 1 C9 3A 90 01
00101F 1 60 E9 2F C9
00102A 1 .end

This is the final BASIC code containing the patch. It can be added to smaller programs to sped them up. After the first time the program has been run, the lines containing the DATA statements, and line 1 can be deleted. The resulting program can be saved with the patch permanently embedded in the REM statement.

0 REM012345678901234567890123456789012345
1 FORI=0 TO 35:READ T:POKE 4102+I,T :NEXT I
2 SYS 4102

10000 DATA 160,18,185,23,16,153,121,4,136,16,247,200,152,153,122,4,96
10010 DATA 160,01,177,59,201,58,144,1,96,233,47,201
10020 DATA 240,240,235,56,233,208,96

Here is the benchmark that prompted me to write this.

10 K=0:I=0:T=0:P=0
30 SCNCLR
100 PRINT "Prime Number Generator"
110 INPUT "Upper Limit";N

120 eTime=TIME
130 T=(N-3)/2
140 DIMA(T+1)

160 FORI=0TOT:A(I)=0:NEXT
200 FORI=0TOT:IFA(I)THENPRINT"..";:NEXT:GOTO330
210P=I+I+3:PRINTP;".";:K=I+P:IFK<=TTHENFORK=KTOTSTEPP:A(K)=1:NEXT:NEXT:GOTO330
260 NEXT

330 eTime=(TIME-eTime)/60
340 PRINT
350 PRINT "Total: ";eTime
360 END

This will speed up the benchmark by over 30%. It disables the screen refresh while the benchmark is running. The screen refresh normally steals that many clock cycles away from the CPU.

115 POKE65286,PEEK(65286)AND239

335 POKE65286,PEEK(65286)OR16

Thursday, November 2, 2017

Just posting a link to a simple music player I wrote for the Apple II Mockingboard a few years ago.
It's written for the 6502 AS65 assembler that comes with the CC65 C compiler.
The Mockingboard has two 6522 VIA chips, and two General Instruments AY sound chips. The VIAs are used to interface the sound chips, and provide additional features such as programmable interrupts.

The player uses a repeating timer interrupt from the VIA chip to play music with minimal impact on the main program that uses it. It's a port of a music player I wrote for the Oric a few years earlier. The Oric has similar hardware out of the box, but only one VIA and one AY chip. That version used a one shot timer though due to me not having VIA docs at the time.

The interrupt handler is pretty low impact. It decrements the timer and exits quickly if no sound chip register needs to be set. It just dumps raw data to the sound chip when required. The data format is documented in the source code, and the code is pretty well commented.
It's not very practical for large pieces of music due to the lack of repeats and resulting size of the data, but for simple background music that repeats endlessly, or one shot sound effects it works really well. The code still needs some work in order to play sound samples and music at the same time, but that's just a matter of adding additional counters for each sound.
The code includes sample data that plays chords, and background music for several screens from Donkey Kong. The emulator disk image should run a demo that lets you select which song to play.

To build the code you will need the AS65 assembler. I used a simple DOS batch file to build it instead of a make file. If you use that, it requires a2tools to automatically transfer the file to a disk image. a2tools can be found on Asimov, but it's only a 32 bit exe. Let me know if you need a 64 bit version and I'll post my custom build. Or you can delete that line from the batch file, and use Cyderpress to transfer it to a disk image.

FWIW, this is a good though simple example of how you can drive a piece of hardware in "real time". You could output and/or sample data at specific intervals. Just adjust the timer for the rate you need and remember to calculate the delay from the trigger of the interrupt until you actually read or write data so you don't drop any data or set a piece of hardware too late.

Here's the link. There is a newer version of the source code in my project directory, but this is a version I shared previously so it should be working. If I get a chance to test the newer code and build a new archive I'll post that.
https://www.dropbox.com/s/xibf8xu9zh3n5bu/DonkeyKongApple.zip?dl=0

Wednesday, October 4, 2017

When all is sed and done...

Yeah it's a corny title for the post but nobody's been reading my blog anyway so who cares?

To generate the raw disassembly I simply had do define all the ROM entry points from the command line. There are a lot of them but it's pretty strait forward. The disassembler follows links (except for RST calls) and it's just sort of slogging through the code finding what's real and garbage. Using the --xref option on the disassembler helps because we can see if if finds anything calling that piece of code. Due to having to define so many things as entry points, a lot of the data is useless, but it still helps.

Adding labels and comments is a bit different. Since labels (variables or functions) can occur in multiple places, we need to use pattern matching to save having to repeat the same operation over and over again. Comments are only added to a single line each, but you need to find the address they are associated with and then append the comment to the end of the line. So we have to generate the disassembly with the address at the start of each line which is done using the --lst option. One side effect of this is it generates a huge call graph at the end of the disassembly which slows down our pattern searching, and it really isn't needed due to the known design of the BASIC interpreter. We can get rid of it manually, but why not just have a program do it. We could write a custom program to do this, but there is no need as the Unix sed command can already do all of this! I haven't released the sed file I'm using yet, but I'll give a quick explanation of what commands I'm using to perform these actions. This is no substitute for the sed manual or other references on the web!

sed can take commands from the command line but since we are performing so many operations, they have all been placed in a single file that is read by using the -f directive on the command line followed by the name of the file containing our list of sed operations.

Since the raw disassembly needs the call graph removed from the end of the file to speed up the rest of the pattern matching (the call graph is huge), that's the first thing the sed script does. Thankfully, the disassembler places an identifier at the head of the call graph that says "Call Graph:"
We simply tell sed to search for that string, match any pattern that follows, and delete the entire thing with the d command. Notice that the string is placed inside forward slash (divide) symbols and the $ signifies matching any pattern that follows. Then the ,$d tells sed what to do with the text it matched. Take the entire block $ and deleted it d.

/Call Graph:$/,$d

Now that the Call Graph is gone, sed needs to relabel functions and system variables by global pattern search and replace. If we want to replace a function label, we search for L followed by the address the function resides at. The s tells sed to search for the sting between the 1st and 2nd slashes, and to replace it with the string inside the 2nd and 3rd slashes. The g says perform this globally. So everywhere a function is called, it will be given a meaningful label instead.

s/L1DAE/END/g
s/L1CA1/FOR/g
s/L0138/RESET/g
s/L0135/SET/g
s/L01C9/CLS/g

The same works for system variables, but those are just raw addresses and not function calls, so the disassembler labels them as a 4 digit hex number followed by h signifying that it is a hexidecimal number. In the case below, we must identify functions that reside in ROM and that are copied to RAM where the program actually calls them. Notice the L prefix on the ROM code and h postfix on the RAM routines. Variables would have h as well.

s/L06D2/COMPAREROM/g
s/7800h/COMPARE/g
s/L06D5/CHARGETROM/g
s/7803h/CHARGET/g

The last (so far) function we need to perform is to append comments to the end of lines based on address. Most of the work here was extracting the comments from the pdf of Level II BASIC Decoded and Other Mysteries. Once the code that is borrowed from Level II BASIC is identified, we can just append the comments to the proper lines using the address. Again, we need to perform a pattern search, but then we must append the comment to the end of the line. We match the hexadecimal address followed by the : using s (search) followed by any string $ and simply append the comment string to the end of the line. The --lst option gives us the hex addresses followed by the colon for each line and this provides a unique pattern we can match with. Again, strings are contained between / slash symbols. In case you haven't already noticed, this also means our strings cannot contain the slash symbol. There are others but I won't go into that. Just watch the use of special characters that sed looks for.

/0013 /s/$/ ;--- Save BC - Keyboard routine/
/0014 /s/$/ ;--- B = Entry code/
/0016 /s/$/ ;--- Go to driver entry routine (3C2)/
/0018 /s/$/ ;--- RST 18 (JP 1C90H) Compare DE:HL/
/001B /s/$/ ;--- Save BC - Display routine, printer routine/

The end result looks something like this. The formatting is messed up partially due to Blogger, but the spacing for the comments will need to be aligned. I think the Entry Point comments will also be deleted since they aren't very useful. *Update* Entry Point labels deleted

; --- START PROC DBL_SUB ---
0C70: 21 2D 79 DBL_SUB: LD HL,792Dh ;--- Double precision subtraction routine. ** cont--> *
0C73: 7E LD A,(HL) ;--- Load MSB of saved value
0C74: EE 80 XOR 80h ;--- Invert sign
0C76: 77 LD (HL),A ;--- And restore
; Referenced from 12B4, 0F84, 0E5C
; --- START PROC DBL_ADD ---
0C77: 21 2E 79 DBL_ADD: LD HL,EXP2 ;--- HL=addr of exponent in WRA2 ************ cont--> *
0C7A: 7E LD A,(HL) ;--- Load exponent from WRA2
0C7B: B7 OR A ;--- Set status flags for exponent
0C7C: C8 RET Z ;--- Exit if WRA2 value zero

No comment

I'm still waiting on the USB to serial adapter for testing the 68hc11 BASIC. I should have ordered from Newegg instead of ebay. But that leaves more time to work on the VZ ROM disassembly.

The goal is to mostly to fully commented VZ ROM disassembly, but up until now I have only thrown in a few comments for things like RST calls.

Below is a little sample of what the disassembly looks like once commends are added.
All function names and comments were added by sed. The comments were extracted from the OCR text of a book for the TRS-80. The formatting could use some work, but it's significant progress. This will result in a fully commented disassembly but with actual labels for system functions and variables that were never in any books. VZ specific code will still need comments, but this will take care of a significant portion of the ROM.

This has involved a lot of data entry, and editing, but the speed at which changes can be made and applied to the entire file is well worth the effort. I should also be able to generate disassemblies of other Z80 machines using Microsoft BASIC much faster if they share system variables and code. The TRS-80 itself should be trivial other than fixing the ROM entry points for the disassembler and moving system variables.

07C3: 21 25 79 L07C3: LD HL,SIGN ;--- Reset sign flag so that ************ see note--> *
07C6: 7E LD A,(HL) ;--- mantissa will have a negative sign
07C7: 2F CPL ;--- Invert the sign flag
07C8: 77 LD (HL),A ;--- Store sign flag
07C9: AF XOR A ;--- Zero A
07CA: 6F LD L,A ;--- then save it
07CB: 90 SUB B ;--- Complement B (0 - B)
07CC: 47 LD B,A ;--- Save new value of B
07CD: 7D LD A,L ;--- Reload zero into A
07CE: 9B SBC A,E ;--- Complement E (0 - E)
07CF: 5F LD E,A ;--- Save new value for E
07D0: 7D LD A,L ;--- Reload A with zero
07D1: 9A SBC A,D ;--- Complement D (0 - D)
07D2: 57 LD D,A ;--- Save new D value
07D3: 7D LD A,L ;--- Reload A with zero
07D4: 99 SBC A,C ;--- Complement C (0 - C)
07D5: 4F LD C,A ;--- Save new C value
07D6: C9 RET ;---Rtn to caller *********** Unpack a SP number ******

Tuesday, October 3, 2017

First Phase of VZ ROM Disassembly Mostly Completed

The VZ ROM disassembly is mostly complete from just a raw code standpoint.
It needs commented, more labels, etc... but it's to the point where you can follow a lot of the code.

Quite a few things are labeled with the sed script now including:
Math constant tables
Key math library functions (add, subtract, multiply, divide)
Commands in the token table
Code & data copied to RAM on startup
RST calls are tagged so you can see what is being called
Interrupt handler
Many system variables
Many text stings for prompts or errors
etc...

There are still a couple holes that haven't been disassembled. I have to determine if they are used or not and it they are, where the entry points are. There is some dead space in the ROM that was filled with garbage, and that isn't tagged in the data yet.

A header needs to be created so it can be reassembled. At that point I can start removing dead code and filler that the VZ doesn't need. Then the empty space can be used for new commands or to speed up the math library by unrolling some loops. That's still a ways off though.

Thursday, September 21, 2017

VZ Disassembly continued...

I'm waiting on a USB to serial adapter to come from China, so the 68hc11 BASIC is pretty much on hold, and I haven't even looked at the IDE interface since I last mentioned it. The IDE port doesn't really need to be finished until I can use the 68hc11 board, and that was going to be used by BASIC... so it's not an immediate priority. I'm working on the VZ ROM disassembly in the time being.

As for the disassembly... there are now enough system variables being labeled by the sed script to where it's possible to identify what many parts of the ROM are doing.

One thing I discovered is that YAZD may not properly handle the RST instruction. It disassembles it, but it may not treat it like a JP. This left some commonly called functions as data. Microsoft used RST to conserve ROM space. While it certainly works, it also slows down the interpreter. But regardless of why, it left me having to manually define all the RST calls as entry points. I have also blindly labeled many other entry points to see what the code looks like.

Anyway... here is another excerpt from the disassembly to give you an idea of how things are shaping up. There are still system variables left to be defined, some of the entry points I manually created will need to be removed, and comments still need to be added. But since the process is automated, it only takes seconds to regenerate the disassembly and add or change labels. Once I create a block of define statements for system variables and label addresses containing string data, I should be able to reassemble the code. I'm debating on whether to fix the disassembler so it can do a lot of this or to just script it in sed to be done with it.

; Referenced from 3747
L3752: INC HL
DEC BC
LD A,C
OR B
JR NZ,L3743
LD HL,7839h
RES 3,(HL)
LD HL,VERIFY_MSG
CALL PUTSTRING
LD HL,OK_MSG
CALL PUTSTRING
JP L36CF

VERIFY_MSG: DB 0Dh
DB 56h ; 'V'
DB 45h ; 'E'
DB 52h ; 'R'
DB 49h ; 'I'
DB 46h ; 'F'
DB 59h ; 'Y'
DB 20h ; ' '
DB 00h

Tuesday, September 19, 2017

VZ ROM Disassembly

Where the combination of YAZD and sed works, you get code that looks like this. With a pass through sed adding comments, this could look really good without major code manipulation on my part.

; Entry Point
; --- START PROC RUN ---
RUN: JP Z,L1B5D
CALL 79C7h
CALL L1B61
LD BC,1D1Eh
JR L1EC1
; Entry Point
; --- START PROC GOSUB ---
GOSUB: LD C,03h
CALL L1963
POP BC
PUSH HL
PUSH HL
LD HL,(78A2h)
EX (SP),HL
LD A,91h
PUSH AF
INC SP
; Referenced from 1EAF
; --- START PROC L1EC1 ---
L1EC1: PUSH BC
; Entry Point
; --- START PROC GOTO ---
GOTO: CALL L1E5A
; Referenced from 1FC7
; --- START PROC L1EC5 ---
L1EC5: CALL REM
PUSH HL
LD HL,(78A2h)
RST 0x18

However, YAZD needs to take configuration input from a file, you should be able to manually identify blocks of data including size/type/length (byte, 16 bit word, string), entry/data points should have configurable labels instead of manually generated ones, and data labels should automatically be generated as an option. Using sed works, but it would be nice to complete this in a single step, especially for other people that aren't familiar with Unix/Linux.

One problem with sed is that it can be a bit indiscriminate in how it performs it's search and replace. Relabeling addresses that are formatted like "L1EC1:" works well, but adding a custom label to actual references to that label (call jp, etc..) can potentially change unrelated values, where YAZD would be able to know the difference.

YAZD is open source, but it's poorly commented. Figuring out the code takes time (though most of it doesn't look bad), and that's before I make changes. It just adds more time to an already time consuming process, and I still have to a create a file with comments for the ROM based on existing comments from TRS-80. At least I don't have to create everything just from the disassembly.

Friday, September 15, 2017

Fedora Plot MC-10

Another comparison of the factory MC-10 ROM vs the modified version.
This is a 3D plot of the "Fedora" hat.
I still have to fix a speed issue with the divide, but it seems stable and all but a handful of programs seem to work. The ones that don't have some address dependencies that would need to be patched.
Once I fix the divide it should be ready for a final release for 8K. Many of the changes I have planned for the 68hc11 version will directly work with the 6803. I plan on saving some of the 68hc11 specific code for last so both versions can be developed in parallel.

Wednesday, September 13, 2017

I ran across a reverse engineered copy of Extended Color BASIC (mostly just Color BASIC actually) that had been ported to other 6809 systems. It's been sitting on my hard drive for some time but I had forgotten about it. At first glance, the math library appears to be very similar to Microcolor BASIC. I could probably squeeze several changes from my MC-10 code in, but I'll have to locate a more intact version of the original ROMs if they exist. There are also a couple things I could borrow for the 68hc11 and 6803 versions.

Saturday, September 9, 2017

IDE/ATA interface details

The IDE/ATA interface I posted is a simple 8 bit design that buffers the high byte of the 16 bit word the interface uses. It's more what you'd use for a Z80 or 6502 machine than a an 8 bit chip with 16 bit support like the 6803, 68hc11, 6809, etc... Many of the IDE/ATA interfaces I've looked at treat the high byte as a zero and only use half the storage capacity of a device. It works but it won't be readable on a PC.

The control lines from the CPLD to the transceiver are not connected because I want to work out the logic require for a 16 bit interface first. The two bi-directional transceivers required for a 16 bit interface each have 6 control lines, so the control logic will require 12 outputs, multiplexed outputs, or several signals being duplicated between both chips. The CPLD I selected only has 8 configurable I/O lines. I know a few signals are duplicates, but the device may be a bit too small. It's cheap though, so if I can get it to work, someone should be able to built an interface for under $30 using perf board, and that's probably a high estimate.

Wednesday, August 30, 2017

IDE/ATA Port

This is the start of an IDE/ATA interface for the HC11 development board. It's mostly a distraction to take a break from the HC11 BASIC and VZ200 ROM disassembly.

It only represents a few hours work, and I spent more time fighting with the tool for designing the parts than on the design itself.

Once the schematic and board layout are complete, I still have to write the code for the CPLD.

My EPROM/device programmer probably doesn't work with Windows 10 so I may have to buy a new one. I haven't even purchased a USB to serial interface for the development board yet.

The driver software won't be that difficult, but I have to decide what I'll use for a file system.

Tuesday, August 29, 2017

ADCD, ADCD, wherefore art thou ADCD? (or... What is missing on the 6803?)

The 6803 isn't bad, but here is a list of what I think are the biggest oversights or shortcommings in the design.

No prefetch
A prefetch makes a huge difference on performance without having to increase the clock speed It typically reduces the number of clock cycles per instruction by 1. One of the key reasons why the 6502 is so fast is due to it's prefetch. Perhaps the biggest speed improvement the Hitachi 6309 has over the 6809 is a prefetch. The Hitachi HD6303 (compatible with the 6803) has a prefetch. That' makes the 6303 at least 20% faster with ZERO changes to the code. An MC-10 with a 6303 should benchmark similar to 2MHz 6502s or 4MHz Z80s.
No direct addressing form of INC or DEC.
When you have so few registers, directly incrementing or decrementing your loop counters in memory saves a lot of register shuffling. A direct addressing mode would save a clock cycle for every INC or DEC executed in a loop. Execute a loop 100 times, save over 101 clock cycles if you include initializing the counter. And that's if you aren't using 16 bits where you might need to DEC or INC twice per loop. Where you don't use indexing, you can use X as a counter, but that's usually for counting down to zero where you don't have to perform a 16 bit test, just BNE until the loop is complete. Direct addressing also requires 1 less byte everywhere it's used.
No ADCD (Add Carry D) or SBCD (subtract carry D). The missing ADCD really impacts the math library. You have to use two instructions, ADCB ADCA, which slows down the code and makes it larger. The BASIC math library could have used 20 or more ADCD and many of them are in loops. Really, 16 bit support on the 6803 and 6809 should have been better. One of the key advancements the 6309 has over the 6809 is finishing out the 16 bit opcodes.
No XGDX (exchange D & X) instruction.
Without XGDX, moving data between D & X requires multiple instructions and an intermediate RAM location. You have to perform math on pointers, but other than ABX, you can only do that with the accumulator. This could have sped up my math library optimizations quite a bit. It would speed up the my line drawing code as well. The HD6303 and 68hc11 support XGDX.. The 68hc12 and 6809 have LEA for performing some math on index registers, and you have the ability to transfer between registers. The same pretty much goes for XGDS. It would make a significant difference when adjusting the stack from compiled code. It's not as efficient as LEAS, but turning 8 or more instructions into 3 and eliminating multiple RAM accesses is a pretty significant improvement.
No Y index register like the 6809, 68hc11, etc... The single index register presents a few problems. You can use the stack pointer for some operations, but that may involve disabling interrupts and you can't use offsets from it like X.
No stack relative addressing. Stack relative addressing is important for compilers. Most high level language compilers pass parameters on the stack, and dynamically allocate variables on the stack. When you only have one index register, accessing variables passed on the stack can become a register swapping mess. Just adding stack relative addressing suddenly makes the 6803 a lot more efficient for compiled code. The 8080, Z80, and 6502 lack stack relative addressing as well, but the 6803 suffers from a bit more index register swapping. It also makes using the stack pointer as an index register much more like using X.
No divide instruction. Hardware math is faster than software math. A hardware divide is going to speed up a lot of math intensive applications. When combined with the MUL instruction, you have a machine that's pretty good for Mandelbrots, calculating primes, fractals, 3D plots, etc... All fun stuff that 8 bits take hours to do. The instruction would only be used a few times in an 8K ROM, but the difference in speed where it would be used is significant.
The built n hardware is addressed on the direct page and cannot be relocated. It interferes with existing software applications from 6800 systems like FLEX. It's not a big deal if you have the source code and don't need to use all of the direct page, but patching binaries to use different index registers isn't simple. The 68hc11 uses $1000, so it doesn't interfere with the direct page, but then it interferes with the code. The HD64180 (Z180) allows you to select where in the Z80 I/O region the hardware is located. Being able to set the high nibble or byte would let the hardware be addressed in a region FLEX normally uses for hardware. ..

Sunday, August 20, 2017

68hc11 BASIC

This morning I made a preliminary pass through the Microcolor BASIC derived source code to come up with an abridged list of changes from the 6803 version. This skips over the sub tasks for each line item.

Remove cassette support.
Remove sound support.
Remove graphics functions
Remove MC-10 hardware from the memory map
Remove 6803 hardware definitions from the memory map
Add 68hc11 hardware definitions and reserve space in the memory map
Add the setup and serial I/O functions to interface with a terminal.
Change CLS to work with serial I/O for a terminal.
Add a BELL command to beep the terminal
Change all printed output to be via terminal
Change all input to be via terminal
Move ELSE command up in the keyword and token tables, remove patches related to ELSE
Change memory moves to use Y instead of the stack pointer
Add code that saves the pointer to the next line
Update divide function to use hardware instructions
Replace SQR function with optimized version
Update multiply with 16 bit multiply instructions

The first five are almost complete. Some pieces of code will be needed once new hardware is added, but I can cut and past from the original 6803 code when the time comes.

The terminal I/O will require the most work unless I can locate a disassembly of the 6800 version. I suppose I could do it myself but it's probably faster to just implement from scratch.

There are several things I may work on before finishing the last few items on that list.
The cruncher could use a little work. It could automatically remove extra spaces, and the fix of inserting a colon before the else doesn't check if there is already a colon in the source code.
PRINT USING would make a nice addition. There are also a few bugs in the ROM that have never been addressed. And this BASIC could use a line editor at some point.
That may be secondary to adding a file system for an IDE interface. I/O is actually much simpler than for cassette, but cassette doesn't have to maintain any kind of directory structure. It may make more sense for support to be more like IDE interfaces on the TRS-80 Model I than the FAT file system from the PC.

A 68hc12 build should be possible once the first 11 steps are finished. Code optimizations that take advantage of it's new features could be added later.

Friday, August 18, 2017

Google is building new OS from scratch

Just a few ramblings related to Google's new OS, Fuchsia. This isn't related to my current projects, but it could prove to be important in the future.

Why a new OS? Companies have been basing more and more systems on Linux. It has almost become the default for building new systems. But should it be that way, and if not, what should they use?

I've personally spent years developing for various flavors of Unix (Unix, Solaris, Linux, etc...) and several years working on embedded systems. I really like Unix/Linux as a development environment, but it seems inappropriate as the basis for many systems, especially embedded systems. It can also present a rather complex environment from a programming standpoint depending on what parts of the OS your code has to interact with.

Linux is based on Unix. There's no getting around it, Linux is pretty much a free implementation of Unix. Unix was designed during the computing world of the late 60's and early 70s. Most programs ran in batch mode, they didn't require much if any networking, they all ran from the console, and they weren't written in an object oriented programming language. The flexibility provided for console based applications is nothing short of amazing, but then it was a console oriented world, and Unix was designed as a development environment. Don't get me wrong, Unix has seen a lot of development since then, but that is it's roots. GUIs didn't exist, the internet didn't exist, object oriented programming didn't really catch on for over a decade after Unix was first released, and personal computers didn't even exist yet. Unix was designed for mini-computers, not for something that fits in the palm of your hand. The fact that it does work for such applications is a testimony to Unix's flexibility and robustness of the design. It's also testimony to the power of modern processors and how small large amounts of RAM and mass storage have become.

Unix is neither lightweight, nor simple. The number of separate modules that must be loaded just to boot is a bit mind numbing. Anyone that's watched the stream of messages when Linux boots should recognize that there must be a faster and simpler way, especially for embedded systems. And from a programming standpoint, the Unix internals can require you to write more code, which requires more development time, which requires more testing and debugging, etc... than a lightweight OS design would require. Programmers have addressed this through using libraries of code, templates, isolating the programmer from it with languages like Java, etc... but ultimately, you are building more and more code which requires more and more memory and more and more CPU time. What if you could eliminate a lot of that by dealing with it at the OS level in the first place, and by simplifying the programming interface?

I pose a few simple questions. First, why should a 25MHz Amiga from 1990 boot from an old, slow hard drive in the same or less time than a 1+ GHz router booting from FLASH or a high speed SD interface? And second, does it need to be this way? The answer to the first question is simply, it shouldn't, and the answer to the second question is obviously, it doesn't. It is simply that way because companies don't want to invest the time and money for the development of an alternative OS, and there really aren't any existing options that compete with it capability wise for free. One embedded project I worked on did develop it's own real time OS. It was written from scratch to duplicate the APIs of a commercial real time OS. The license fees for the commercial OS were more expensive than paying a programmer to write it from scratch. But then, Linux didn't exist yet, so there wasn't a free alternative. One of the takeaways from that project was that the box booted in a manner of seconds even with the self test during startup. Linux would have taken over a minute and would have required more memory. How many products would work better if they were based on something else? Should it really take that long to boot a router? How about a TV? Seriously, why does my smart TV take 30 seconds to boot?

Google's new OS may or may not addresses these issues, but additional options are certainly welcome. It certainly won't replace Linux for everything, but hopefully, it will provide an alternative for systems where Linux doesn't seem appropriate. If it can reduce development times, memory footprint, boot times, etc... it will be a very attractive alternative. More importantly, it will stir things up a bit in a world that has become entrenched around Linux. And while Fuchsia may not be the answer to these problems, but maybe an offshoot of it will. It is open source after all. It will be interesting to see what this project leads to.

Article Link

Tuesday, August 15, 2017

68HC11

My new to me 68hc11 board is waiting at the post office! It's a much wanted (I don't NEED it mind you) replacement for the one that was damage in a fire. For under $15 off of ebay shipped, that's a bargain if it works! I'm guessing this is based off of the Buffalo monitor ROM.

Expect a 68hc11 version of Microsoft like BASIC in the future (yet another project to divide my time further) There aren't going to be a lot of changes to the main interpreter. Memory moves won't need to use the stack pointer, I can dedicate the Y register to pointing to the next character for large sections of the interpreter. I/O would have to completely change though. The bit banger cassette I/O will probably be replaced with an IDE port and simple DOS. Video out and keyboard will be through a terminal (Buffalo calls?). I may decide to stick on a V9958 and PC keyboard interface.

The hc11 has a few added instructions that may come in handy. The hc11 supports XGDX like the 6803 and XGDY which might allow the use of X and Y for temporary storage without having to use direct page RAM in a few places. It's only a space saving optimization if I'm storing 8 bit registers in X as it takes 3 clock cycles and using Y requires 4. It would be faster or the same speed to save D in X though. Y requires 2 byte opcodes, so it's slower than X and slower than using the stack pointer.but you don't have to disable interrupts like with the stack pointer. It's definitely faster than constantly changing X.

There is a version of GCC for the 68hc11, so that's a definite improvement over the 6803.

Monday, August 14, 2017

You can find a good comparison of different square root algorithms at this link.

A couple tweaks later...

After a couple small tweaks to normalization code and a one long test later... there is a much greater speed increase than with the previous test. I'm not sure the small change that was made can account for this. Maybe I had the wrong ROM loaded before. Perhaps a bad build and I missed it?

The Life comparison shows the new version to be at least 3/4 of a generation ahead after 100 generations. I'll take another 3/4 of 1% for a single change if this is consistent across the board.
However, tests with the circle drawing code indicate there are far too many instances where minimal normalization is required, and the code is obviously slower. This leaves my previous version as the better choice overall.

The previous post reflects the changes I just tested.

Some normalization code

Here is the original MC-10 ROM routine for normalizing a floating point number 8 bits at a time. After this it rotates 1 bit at a time.

;* Normalize FPA0
LEFD6 clrb ; exponent modifier = 0
LEFD7 ldaa FPA0 ; get hi-order byte of mantissa
bne LF00F ; branch if <> 0 (shift one bit at a time)

;* Shift FPA0 mantissa left by 8 bits (whole byte at a time)
ldaa FPA0+1 ; byte 2 into..
staa FPA0 ; ..byte 3
ldaa FPA0+2 ; byte 1 into..
staa FPA0+1 ; ..byte 2
ldaa FPA0+3 ; byte 0 into..
staa FPA0+2 ; ..byte 1
ldaa FPSBYT ; sub-precision byte..
staa FPA0+3 ; ..into byte 0
clr FPSBYT ; 0 into sub-precision byte
addb #8 ; add 8 to the exponent modifier
cmpb #5*8 ; has the mantissa been cleared (shifted 40 bits)?
blt LEFD7 ; loop if less than 40 bits shifted

Here is the current test code.
FPSBYT only needs cleared the first pass, but I don't see a way of getting ride of it without unrolling the loop once. Positioning FPSBYT after FPA0 would let us speed this up, but it would be at the cost of a few clock cycles in the multiply. But normalization should take place more often than multiplication, so I may test that at some point to see if it's an improvement. The ldab at the end is commented out because it isn't needed. If we get that far the mantissa is zero.

;* Normalize FPA0
LEFD6
ldx #5 ; loop a maximum of 5 times (cleared mantissa)
LEFD7
ldaa FPA0 ; get hi-order byte of mantissa
bne LEFFFa ; branch if <> 0 (shift one bit at a time)
;* Shift FPA0 mantissa left by 8 bits (whole byte at a time)
ldd FPA0+1
std FPA0
ldaa FPA0+3
ldab FPSBYT
std FPA0+2
clr FPSBYT ; 0 into sub-precision byte
dex ; has the mantissa been cleared (shifted 40 bits)?
bne LEFD7 ; loop if less than 40 bits shifted

; ldab #8*5 ; set the exponent modifier where x = 0

then we calculate the exponent modifier right before the single bitshift code. If A is negative, we skip the next code, if not, we directly fall into shifting bits.

LEFFFa ;calculate exponent modifier
stx TEMPM
ldab #5 ; max # of times we shifted the mantissa
subb TEMPM+1 ; actual number of times we shifted the mantissa
; no need to test result, always > 0
rolb ;* 2 ; muliply 8 for 8 bits per byte
rolb ;* 4
rolb ;* 8
tsta ; Is A positive or negative?
bmi LF00Fa

If we use the 6303 we can do this. The xgdx instruction exchanges the contents of D and X. So in one XGDX instruction, we load the address of the table-1 into X, and the table offset into B.
The table address minus 1 is because this is never called when x=0. The table itself is only 5 bytes with pre-calculated values for the exponent modifier, and the table can reside anywhere it ROM. Technically, 8*5 is never loaded because this isn't called when X goes to zero, so we just leave it off and skip subtracting from X or B by loading the xpmtable pointer -1.

LEFFFa ;calculate exponent modifier, 6303 version
ldd #xpmtable-1 ; get the address of the exponent modifier table
xgdx ; exchange D and X
abx ; point X to modifier in table
ldab ,x ; load it
tsta ; is A positive or negative?
bmi LF00Fa

xpmtable
fcb 8*4, 8*3, 8*2, 8, 0

Sunday, August 13, 2017

Faster Normalization

I tested a new floating point normalization routine last night.
The loop that performs the normalization is definitely faster, but the adjustment to the exponent is calculated outside the loop. This makes the routines slightly slower if no byte oriented normalization is needed, only slightly faster if one pass is needed, and definitely faster if 2-5 passes are needed.

The problem with an optimization like this is that it's difficult to tell which is faster in real world use. You can't just count clock cycles. The only sure test is benchmarks. I ran the Life program I've been using side by side with the previous ROM version. After running overnight... the new version is definitely faster. But the Life status bar is only about 2 blocks different after over 200 generations. More testing and a size comparison will be needed to see if it stays. If it's always faster and within a few bytes of the previous version, I'll keep it in. It's definitely faster and smaller than the Microsoft version which only shifted the mantissa a byte at a time with the A register. It's very obvious the original 6800 code was used here.

Friday, August 11, 2017

Fitting a SQR peg in a Microsoft hole

One of the biggest challenges of optimizing the MC-10's BASIC has been in improving the performance of the floating point library. Here I'll specifically discuss the replacement of the SQR() (square root) function.

There are two known fast algorithms I've looked at using. Both depend on the IEEE single precision floating point format which is as follows:

| 1 sign bit | 8 bit exponent | 23 bit mantissa |

But Microcolor BASIC uses a larger mantissa, and uses two slightly different formats internally. One is packed to save memory, and the other unpacked to simplify calculations.
Here is the packed format:

| 8 bit exponent | 1 bit sign | 31 bit mantissa |

And here is the unpacked format which is used during floating point calculations:

| 1 byte exponent | 32 bit mantissa | 1 byte sign |

The larger mantissa is simply an extra byte which provides additional accuracy without a huge loss in speed required for double precision, and the mantissa appears to be largely treated as 31 bit.

The two formulas I've looked at for performing a fast SQR are presented below in C source code, and they were taken from this page.

/* Assumes that float is in the IEEE 754 single precision floating point format
 * and that int is 32 bits. */
float sqrt_approx(float z)
{
    int val_int = *(int*)&z; /* Same bits, but as an int */
    /*
     * To justify the following code, prove that
     *
     * ((((val_int / 2^m) - b) / 2) + b) * 2^m = ((val_int - 2^m) / 2) + ((b + 1) / 2) * 2^m)
     *
     * where
     *
     * b = exponent bias
     * m = number of mantissa bits
     *
     * .
     */

    val_int -= 1 << 23; /* Subtract 2^m. */
    val_int >>= 1; /* Divide by 2. */
    val_int += 1 << 29; /* Add ((b + 1) / 2) * 2^m. */

    return *(float*)&val_int; /* Interpret again as float */
}

val_int -= 1 << 23 is easy enough. In our case the bit is shifted 31 times due to the larger mantissa, and an additional time for the sign bit... so 1 << 31. But a simpler way to but it is to subtract 1 from the exponent.

val_int >>= 1 is a bit of a problem since on IEEE format, the bit shifts into the mantissa, and in Microcolor BASIC format, it shifts into the sign bit. Loosing the sign bit is not a problem due to the fact that we should have already generated an FC error (illegal function) if we are trying to use SQR on a negative number. But it's not actually in the proper bit in the mantissa.

The solution is actually quite simple. here is the fix using register FPA5 in packed format:

ldd FPA5 ; Get exponent and first mantissa byte
suba #1 ; subtract 1 from the exponent
rolb ; get rid of the space for the sign bit
rora ; shift mantissa to match IEEE ...
rorb ; ... grab the carry and finish the IEEE match

Now we just >>= 1

rora
rorb
std FPA5 ; save the exponent & 1st mantissa byte
ror FPA5+2
ror FPA5+3
ror FPA5+4

Now we just need to add 1 << 29. So in our case, 1 << 37. That is in the leftmost byte holding the exponent. We could just add it, but we need to return our number to Microcolor BASIC's format, so will will combine the two for speed, and we end up with adding 1 << 38. But that only requires addition with a single byte, so we add $10 before shifting or $20 after.

ldd FPA5
rolb
rola
adaa #$20 ; this should also clear the carry for the next instruction
rorb
std FPA5

Other than the initial checks for SQR(-num), normalization, etc... that's about it. It's small, fast, and simple. The problem with this approach is that it's just an approximation and the error adds up.

Here is the other approach I've looked at which was written by Greg Walsh.

float invSqrt(float x)
{
    float xhalf = 0.5f*x;
    union
    {
        float x;
        int i;
    } u;
    u.x = x;
    u.i = 0x5f375a86 - (u.i >> 1);
    /* The next line can be repeated any number of times to increase accuracy */
    u.x = u.x * (1.5f - xhalf * u.x * u.x);
    return u.x;
}

The first thing you should notice is that there are 4 multiplies.
Calling the ROM floating point multiply 4 times adds up to a lot of clock cycles. There are more clock cycles used by the multiply instructions alone than the entire first approach, and that is a small fraction of the total. It's still much faster than Microsoft's algorithm. If speed is more important than accuracy, it's not the way to go, but since BASIC programs may depend on accuracy, it is the better choice here. Since the multiply has already been altered to use the hardware multiply instruction, there will be less penalty than if we were using a CPU like the 6502 or Z80.

This algorithm also calculates an approximation similar to before, but it uses a mathematically derived "magic" constant which appears to generate a more accurate result than the first approach. You could uses it without the multiplication if an approximation were accurate enough. (You can read about the "magic" constant on this page.) The multiplication is to improve the accuracy of the estimate through one iteration of Newton's method. The constant should probably be extended a byte to match Microcolor BASIC's larger mantissa. This might improve the accuracy of the estimate which is already within about 4% of actual.

I'm not going to post the code for this approach, it starts out manipulating the number into IEEE format in the same manner as before, it subtracts the appropriate bytes from the "magic" constant, and then it restores Microcolor numeric format for a series of calls that load floating point registers, and perform multiplication, subtraction, etc... It's a pretty straightforward use of the ROM's floating point library. The code would also be almost identical for the 6809 or 68hc11 depending on the floating point format you are using.

There are multiple algorithms for performing a faster square root than the original Microsoft approach. If the error resulting from approximation were acceptable, they would seem much faster than the Walsh approach I'm using. But for a general purpose interpreter, it's better to maintain accuracy.