Wednesday, August 30, 2017

IDE/ATA Port

This is the start of an IDE/ATA interface for the HC11 development board.  It's mostly a distraction to take a break from the HC11 BASIC and VZ200 ROM disassembly.

It only represents a few hours work, and I spent more time fighting with the tool for designing the parts than on the design itself.

Once the schematic and board layout are complete, I still have to write the code for the CPLD.

My EPROM/device programmer probably doesn't work with Windows 10 so I may have to buy a new one.  I haven't even purchased a USB to serial interface for the development board yet.

The driver software won't be that difficult, but I have to decide what I'll use for a file system.

Tuesday, August 29, 2017

ADCD, ADCD, wherefore art thou ADCD? (or... What is missing on the 6803?)

The 6803 isn't bad, but here is a list of what I think are the biggest oversights or shortcommings in the design.
  1. No prefetch
    A prefetch makes a huge difference on performance without having to increase the clock speed It typically reduces the number of clock cycles per instruction by 1. One of the key reasons why the 6502 is so fast is due to it's prefetch. Perhaps the biggest speed improvement the Hitachi 6309 has over the 6809 is a prefetch. The Hitachi HD6303 (compatible with the 6803) has a prefetch. That' makes the 6303 at least 20% faster with ZERO changes to the code. An MC-10 with a 6303 should benchmark similar to 2MHz 6502s or 4MHz Z80s.
  2. No direct addressing form of INC or DEC.
    When you have so few registers, directly incrementing or decrementing your loop counters in memory saves a lot of register shuffling. A direct addressing mode would save a clock cycle for every INC or DEC executed in a loop. Execute a loop 100 times, save over 101 clock cycles if you include initializing the counter. And that's if you aren't using 16 bits where you might need to DEC or INC twice per loop. Where you don't use indexing, you can use X as a counter, but that's usually for counting down to zero where you don't have to perform a 16 bit test, just BNE until the loop is complete. Direct addressing also requires 1 less byte everywhere it's used.
  3. No ADCD (Add Carry D) or SBCD (subtract carry D). The missing ADCD really impacts the math library. You have to use two instructions, ADCB ADCA, which slows down the code and makes it larger. The BASIC math library could have used 20 or more ADCD and many of them are in loops. Really, 16 bit support on the 6803 and 6809 should have been better. One of the key advancements the 6309 has over the 6809 is finishing out the 16 bit opcodes.
  4. No XGDX (exchange D & X) instruction.
    Without XGDX, moving data between D & X requires multiple instructions and an intermediate RAM location. You have to perform math on pointers, but other than ABX, you can only do that with the accumulator. This could have sped up my math library optimizations quite a bit. It would speed up the my line drawing code as well. The HD6303 and 68hc11 support XGDX.. The 68hc12 and 6809 have LEA for performing some math on index registers, and you have the ability to transfer between registers. The same pretty much goes for XGDS. It would make a significant difference when adjusting the stack from compiled code. It's not as efficient as LEAS, but turning 8 or more instructions into 3 and eliminating multiple RAM accesses is a pretty significant improvement.
  5. No Y index register like the 6809, 68hc11, etc... The single index register presents a few problems. You can use the stack pointer for some operations, but that may involve disabling interrupts and you can't use offsets from it like X.
  6. No stack relative addressing. Stack relative addressing is important for compilers. Most high level language compilers pass parameters on the stack, and dynamically allocate variables on the stack. When you only have one index register, accessing variables passed on the stack can become a register swapping mess. Just adding stack relative addressing suddenly makes the 6803 a lot more efficient for compiled code. The 8080, Z80, and 6502 lack stack relative addressing as well, but the 6803 suffers from a bit more index register swapping. It also makes using the stack pointer as an index register much more like using X.
  7. No divide instruction. Hardware math is faster than software math. A hardware divide is going to speed up a lot of math intensive applications. When combined with the MUL instruction, you have a machine that's pretty good for Mandelbrots, calculating primes, fractals, 3D plots, etc... All fun stuff that 8 bits take hours to do. The instruction would only be used a few times in an 8K ROM, but the difference in speed where it would be used is significant.
  8. The built n hardware is addressed on the direct page and cannot be relocated. It interferes with existing software applications from 6800 systems like FLEX. It's not a big deal if you have the source code and don't need to use all of the direct page, but patching binaries to use different index registers isn't simple. The 68hc11 uses $1000, so it doesn't interfere with the direct page, but then it interferes with the code. The HD64180 (Z180) allows you to select where in the Z80 I/O region the hardware is located. Being able to set the high nibble or byte would let the hardware be addressed in a region FLEX normally uses for hardware. ..

Sunday, August 20, 2017

68hc11 BASIC

This morning I made a preliminary pass through the Microcolor BASIC derived source code to come up with an abridged list of changes from the 6803 version.  This skips over the sub tasks for each line item.
  1. Remove cassette support. 
  2. Remove sound support.
  3. Remove graphics functions
  4. Remove MC-10 hardware from the memory map
  5. Remove 6803 hardware definitions from the memory map
  6. Add 68hc11 hardware definitions and reserve space in the memory map
  7. Add the setup and serial I/O functions to interface with a terminal.
  8. Change CLS to work with serial I/O for a terminal.
  9. Add a BELL command to beep the terminal
  10. Change all printed output to be via terminal
  11. Change all input to be via terminal
  12. Move ELSE command up in the keyword and token tables, remove patches related to ELSE
  13. Change memory moves to use Y instead of the stack pointer
  14. Add code that saves the pointer to the next line
  15. Update divide function to use hardware instructions
  16. Replace SQR function with optimized version
  17. Update multiply with 16 bit multiply instructions

The first five are almost complete.  Some pieces of code will be needed once new hardware is added, but I can cut and past from the original 6803 code when the time comes.

The terminal I/O will require the most work unless I can locate a disassembly of the 6800 version.  I suppose I could do it myself but it's probably faster to just implement from scratch.

There are several things I may work on before finishing the last few items on that list.
The cruncher could use a little work.  It could automatically remove extra spaces, and the fix of inserting a colon before the else doesn't check if there is already a colon in the source code.
PRINT USING would make a nice addition.  There are also a few bugs in the ROM that have never been addressed.  And this BASIC could use a line editor at some point.
That may be secondary to adding a file system for an IDE interface.  I/O is actually much simpler than for cassette, but cassette doesn't have to maintain any kind of directory structure.  It may make more sense for support to be more like IDE interfaces on the TRS-80 Model I than the FAT file system from the PC.

A 68hc12 build should be possible once the first 11 steps are finished.  Code optimizations that take advantage of it's new features could be added later.

Friday, August 18, 2017

Google is building new OS from scratch

Just a few ramblings related to Google's new OS, Fuchsia.  This isn't related to my current projects, but it could prove to be important in the future.


Why a new OS?  Companies have been basing more and more systems on Linux.  It has almost become the default for building new systems.  But should it be that way, and if not, what should they use?

I've personally spent years developing for various flavors of Unix (Unix, Solaris, Linux, etc...) and several years working on embedded systems.  I really like Unix/Linux as a development environment, but it seems inappropriate as the basis for many systems, especially embedded systems.  It can also present a rather complex environment from a programming standpoint depending on what parts of the OS your code has to interact with.

Linux is based on Unix.  There's no getting around it, Linux is pretty much a free implementation of Unix.  Unix was designed during the computing world of the late 60's and early 70s.  Most programs ran in batch mode, they didn't require much if any networking, they all ran from the console, and they weren't written in an object oriented programming language.  The flexibility provided for console based applications is nothing short of amazing, but then it was a console oriented world, and Unix was designed as a development environment.  Don't get me wrong, Unix has seen a lot of development since then, but that is it's roots.  GUIs didn't exist, the internet didn't exist, object oriented programming didn't really catch on for over a decade after Unix was first released, and personal computers didn't even exist yet.  Unix was designed for mini-computers, not for something that fits in the palm of your hand.  The fact that it does work for such applications is a testimony to Unix's flexibility and robustness of the design.  It's also testimony to the power of modern processors and how small large amounts of RAM and mass storage have become.

Unix is neither lightweight, nor simple.  The number of separate modules that must be loaded just to boot is a bit mind numbing.  Anyone that's watched the stream of messages when Linux boots should recognize that there must be a faster and simpler way, especially for embedded systems.  And from a programming standpoint, the Unix internals can require you to write more code, which requires more development time, which requires more testing and debugging, etc... than a lightweight OS design would require.  Programmers have addressed this through using libraries of code, templates, isolating the programmer from it with languages like Java, etc... but ultimately, you are building more and more code which requires more and more memory and more and more CPU time.  What if you could eliminate a lot of that by dealing with it at the OS level in the first place, and by simplifying the programming interface?

I pose a few simple questions.  First, why should a 25MHz Amiga from 1990 boot from an old, slow hard drive in the same or less time than a 1+ GHz router booting from FLASH or a high speed SD interface?  And second, does it need to be this way?  The answer to the first question is simply, it shouldn't, and the answer to the second question is obviously, it doesn't.  It is simply that way because companies don't want to invest the time and money for the development of an alternative OS, and there really aren't any existing options that compete with it capability wise for free.  One embedded project I worked on did develop it's own real time OS.  It was written from scratch to duplicate the APIs of a commercial real time OS.  The license fees for the commercial OS were more expensive than paying a programmer to write it from scratch.  But then, Linux didn't exist yet, so there wasn't a free alternative.  One of the takeaways from that project was that the box booted in a manner of seconds even with the self test during startup.  Linux would have taken over a minute and would have required more memory.  How many products would work better if they were based on something else?  Should it really take that long to boot a router?  How about a TV?  Seriously, why does my smart TV take 30 seconds to boot?

Google's new OS may or may not addresses these issues, but additional options are certainly welcome.  It certainly won't replace Linux for everything, but hopefully, it will provide an alternative for systems where Linux doesn't seem appropriate.  If it can reduce development times, memory footprint, boot times, etc... it will be a very attractive alternative.  More importantly, it will stir things up a bit in a world that has become entrenched around Linux.  And while Fuchsia may not be the answer to these problems, but maybe an offshoot of it will.  It is open source after all.  It will be interesting to see what this project leads to.


Article Link

Tuesday, August 15, 2017

68HC11

My new to me 68hc11 board is waiting at the post office!  It's a much wanted (I don't NEED it mind you) replacement for the one that was damage in a fire.  For under $15 off of ebay shipped, that's a bargain if it works!  I'm guessing this is based off of the Buffalo monitor ROM.

Expect a 68hc11 version of Microsoft like BASIC in the future (yet another project to divide my time further)  There aren't going to be a lot of changes to the main interpreter.  Memory moves won't need to use the stack pointer, I can dedicate the Y register to pointing to the next character for large sections of the interpreter.  I/O would have to completely change though.  The bit banger cassette I/O will probably be replaced with an IDE port and simple DOS.  Video out and keyboard will be through a terminal (Buffalo calls?).  I may decide to stick on a V9958 and PC keyboard interface.

The hc11 has a few added instructions that may come in handy.  The hc11 supports XGDX like the 6803 and XGDY which might allow the use of X and Y for temporary storage without having to use direct page RAM in a few places.  It's only a space saving optimization if I'm storing 8 bit registers in X as it takes 3 clock cycles and using Y requires 4.  It would be faster or the same speed to save D in X though.  Y requires 2 byte opcodes, so it's slower than X and slower than using the stack pointer.but you don't have to disable interrupts like with the stack pointer. It's definitely faster than constantly changing X.

There is a version of GCC for the 68hc11, so that's a definite improvement over the 6803.

Monday, August 14, 2017

You can find a good comparison of different square root algorithms at this link.

A couple tweaks later...

After a couple small tweaks to normalization code and a one long test later... there is a much greater speed increase than with the previous test.  I'm not sure the small change that was made can account for this.  Maybe I had the wrong ROM loaded before.  Perhaps a bad build and I missed it?

The Life comparison shows the new version to be at least 3/4 of a generation ahead after 100 generations.  I'll take another 3/4 of 1% for a single change if this is consistent across the board.
However, tests with the circle drawing code indicate there are far too many instances where minimal normalization is required, and the code is obviously slower.  This leaves my previous version as the better choice overall.

The previous post reflects the changes I just tested.

Some normalization code


Here is the original MC-10 ROM routine for normalizing a floating point number 8 bits at a time.  After this it rotates 1 bit at a time.


;* Normalize FPA0
LEFD6     clrb                          ; exponent modifier = 0
LEFD7     ldaa      FPA0                ; get hi-order byte of mantissa
          bne       LF00F               ; branch if <> 0  (shift one bit at a time)

;* Shift FPA0 mantissa left by 8 bits (whole byte at a time)
          ldaa      FPA0+1              ; byte 2 into..
          staa      FPA0                ; ..byte 3
          ldaa      FPA0+2              ; byte 1 into..
          staa      FPA0+1              ; ..byte 2
          ldaa      FPA0+3              ; byte 0 into..
          staa      FPA0+2              ; ..byte 1
          ldaa      FPSBYT              ; sub-precision byte..
          staa      FPA0+3              ; ..into byte 0
          clr       FPSBYT              ; 0 into sub-precision byte
          addb      #8                  ; add 8 to the exponent modifier
          cmpb      #5*8                ; has the mantissa been cleared (shifted 40 bits)?
          blt       LEFD7               ; loop if less than 40 bits shifted

Here is the current test code.
FPSBYT only needs cleared the first pass, but I don't see a way of getting ride of it without unrolling the loop once.  Positioning FPSBYT after FPA0 would let us speed this up, but it would be at the cost of a few clock cycles in the multiply.  But normalization should take place more often than multiplication, so I may test that at some point to see if it's an improvement.  The ldab at the end is commented out because it isn't needed.  If we get that far the mantissa is zero.

;* Normalize FPA0
LEFD6
ldx #5 ; loop a maximum of 5 times (cleared mantissa)
LEFD7
ldaa  FPA0                                ; get hi-order byte of mantissa
   bne  LEFFFa                            ; branch if <> 0  (shift one bit at a time)
   ;* Shift FPA0 mantissa left by 8 bits (whole byte at a time)
   ldd FPA0+1
std FPA0
ldaa FPA0+3
ldab FPSBYT
std FPA0+2
clr   FPSBYT                          ; 0 into sub-precision byte
dex                                  ; has the mantissa been cleared (shifted 40 bits)?
bne         LEFD7                             ; loop if less than 40 bits shifted
; ldab #8*5  ; set the exponent modifier where x = 0

then we calculate the exponent modifier right before the single bitshift code.  If A is negative, we skip the next code, if not, we directly fall into shifting bits.

LEFFFa ;calculate exponent modifier
stx TEMPM
ldab #5 ; max # of times we shifted the mantissa
subb TEMPM+1 ; actual number of times we shifted the mantissa
; no need to test result, always > 0
rolb ;* 2 ; muliply 8 for 8 bits per byte
rolb ;* 4
rolb ;* 8
tsta ; Is A positive or negative?
  bmi LF00Fa


If we use the 6303 we can do this.  The xgdx instruction exchanges the contents of D and X.  So in one XGDX  instruction, we load the address of the table-1 into X, and the table offset into B.
The table address minus 1 is because this is never called when x=0.  The table itself is only 5 bytes with pre-calculated values for the exponent modifier, and the table can reside anywhere it ROM.  Technically, 8*5 is never loaded because this isn't called when X goes to zero, so we just leave it off and skip subtracting from X or B by loading the xpmtable pointer -1.


LEFFFa ;calculate exponent modifier, 6303 version
ldd #xpmtable-1 ; get the address of the exponent modifier table
xgdx ; exchange D and X
abx ; point X to modifier in table
ldab ,x ; load it
tsta ; is A positive or negative?
bmi LF00Fa
xpmtable
fcb 8*4, 8*3, 8*2, 8, 0

Sunday, August 13, 2017

Faster Normalization

I tested a new floating point normalization routine last night.
The loop that performs the normalization is definitely faster, but the adjustment to the exponent is calculated outside the loop.  This makes the routines slightly slower if no byte oriented normalization is needed, only slightly faster if one pass is needed, and definitely faster if 2-5 passes are needed.

The problem with an optimization like this is that it's difficult to tell which is faster in real world use.  You can't just count clock cycles.   The only sure test is benchmarks.  I ran the Life program I've been using side by side with the previous ROM version.  After running overnight... the new version is definitely faster.  But the Life status bar is only about 2 blocks different after over 200 generations.  More testing and a size comparison will be needed to see if it stays.  If it's always faster and within a few bytes of the previous version, I'll keep it in.  It's definitely faster and smaller than the Microsoft version which only shifted the mantissa a byte at a time with the A register.  It's very obvious the original 6800 code was used here.

Friday, August 11, 2017

Fitting a SQR peg in a Microsoft hole

One of the biggest challenges of optimizing the MC-10's BASIC has been in improving the performance of the floating point library.  Here I'll specifically discuss the replacement of the SQR() (square root) function.

There are two known fast algorithms I've looked at using.  Both depend on the IEEE single precision floating point format which is as follows:
| 1 sign bit | 8 bit exponent | 23 bit mantissa |
But Microcolor BASIC uses a larger mantissa, and uses two slightly different formats internally.  One is packed to save memory, and the other unpacked to simplify calculations.
Here is the packed format:
| 8 bit exponent | 1 bit sign | 31 bit mantissa |
And here is the unpacked format which is used during floating point calculations:
| 1 byte exponent | 32 bit mantissa | 1 byte sign | 
The larger mantissa is simply an extra byte which provides additional accuracy without a huge loss in speed required for double precision, and the mantissa appears to be largely treated as 31 bit.

The two formulas I've looked at for performing a fast SQR are presented below in C source code, and they were taken from this page.

/* Assumes that float is in the IEEE 754 single precision floating point format
 * and that int is 32 bits. */
float sqrt_approx(float z)
{
    int val_int = *(int*)&z; /* Same bits, but as an int */
    /*
     * To justify the following code, prove that
     *
     * ((((val_int / 2^m) - b) / 2) + b) * 2^m = ((val_int - 2^m) / 2) + ((b + 1) / 2) * 2^m)
     *
     * where
     *
     * b = exponent bias
     * m = number of mantissa bits
     *
     * .
     */

    val_int -= 1 << 23; /* Subtract 2^m. */
    val_int >>= 1; /* Divide by 2. */
    val_int += 1 << 29; /* Add ((b + 1) / 2) * 2^m. */

    return *(float*)&val_int; /* Interpret again as float */
}

val_int -= 1 << 23 is easy enough.  In our case the bit is shifted 31 times due to the larger mantissa, and an additional time for the sign bit... so 1 << 31.  But a simpler way to but it is to subtract 1 from the exponent.

val_int >>= 1 is a bit of a problem since on IEEE format, the bit shifts into the mantissa, and in Microcolor BASIC format, it shifts into the sign bit.  Loosing the sign bit is not a problem due to the fact that we should have already generated an FC error (illegal function) if we are trying to use SQR on a negative number.  But it's not actually in the proper bit in the mantissa.

The solution is actually quite simple.  here is the fix using register FPA5 in packed format:
ldd FPA5          ; Get exponent and first mantissa byte
suba  #1               ; subtract 1 from the exponent
rolb                      ; get rid of the space for the sign bit
rora                      ; shift mantissa to match IEEE ...
rorb                      ; ... grab the carry and finish the IEEE match

Now we just >>= 1
rora
rorb
std  FPA5            ; save the exponent & 1st mantissa byte
ror  FPA5+2
ror  FPA5+3
ror  FPA5+4
Now we just need to add 1 << 29.  So in our case, 1 << 37.  That is in the leftmost byte holding the exponent.   We could just add it, but we need to return our number to Microcolor BASIC's format, so will will combine the two for speed, and we end up with adding 1 << 38.  But that only requires addition with a single byte, so we add $10 before shifting or $20 after.
ldd      FPA5
rolb
rola
adaa    #$20 ; this should also clear the carry for the next instruction
rorb
std      FPA5

Other than the initial checks for SQR(-num), normalization, etc... that's about it.  It's small, fast, and simple.  The problem with this approach is that it's just an approximation and the error adds up.

Here is the other approach I've looked at which was written by Greg Walsh.
float invSqrt(float x)
{
    float xhalf = 0.5f*x;
    union
    {
        float x;
        int i;
    } u;
    u.x = x;
    u.i = 0x5f375a86 - (u.i >> 1);
    /* The next line can be repeated any number of times to increase accuracy */
    u.x = u.x * (1.5f - xhalf * u.x * u.x);
    return u.x;
}
The first thing you should notice is that there are 4 multiplies.
Calling the ROM floating point multiply 4 times adds up to a lot of clock cycles.  There are more clock cycles used by the multiply instructions alone than the entire first approach, and that is a small fraction of the total.  It's still much faster than Microsoft's algorithm.  If speed is more important than accuracy, it's not the way to go, but since BASIC programs may depend on accuracy, it is the better choice here.  Since the multiply has already been altered to use the hardware multiply instruction, there will be less penalty than if we were using a CPU like the 6502 or Z80.

This algorithm also calculates an approximation similar to before, but it uses a mathematically derived "magic" constant which appears to generate a more accurate result than the first approach.  You could uses it without the multiplication if an approximation were accurate enough.  (You can read about the "magic" constant on this page.)  The multiplication is to improve the accuracy of the estimate through one iteration of Newton's method.  The constant should probably be extended a byte to match Microcolor BASIC's larger mantissa.  This might improve the accuracy of the estimate which is already within about 4% of actual.

I'm not going to post the code for this approach, it starts out manipulating the number into IEEE format in the same manner as before, it subtracts the appropriate bytes from the "magic" constant, and then it restores Microcolor numeric format for a series of calls that load floating point registers, and perform multiplication, subtraction, etc...  It's a pretty straightforward use of the ROM's floating point library.  The code would also be almost identical for the 6809 or 68hc11 depending on the floating point format you are using.


There are multiple algorithms for performing a faster square root than the original Microsoft approach.  If the error resulting from approximation were acceptable, they would seem much faster than the Walsh approach I'm using.  But for a general purpose interpreter, it's better to maintain accuracy.

Tuesday, August 8, 2017

The VZ ROM disassembly is shaping up nicely.  It seams you have to add an entry point for every command from the jump table since most aren't called any other way.  I was actually expecting that, there's no way for the disassembler to reach that code via the main loop.
The one little problem is that the disassembler doesn't accept input from a text file, so they all have to be on the command line, and the token table can hold 100+ pointers for commands.
I'm running from a batch file now.
Now I just have to finish that list and start slogging through all the math library calls.

Into the great divide

I thought of another optimization to try in a couple places for the MC-10 ROM.
It *might* save 90 or more clock cycles per divide which would be a very good thing.
That's what?  20+ instructions worth?  It won't help with Ahl's benchmark, but it will speed up some other things a little.  It's not huge like the hardware multiply but every little bit helps.

Monday, August 7, 2017

VZ200 ROM Disassembly

Last time I worked on a commented disassembly of the VZ200 ROM, I used a disassembler called dZ80.  If it has a trace file output from an emulator, it can do a pretty good job of identifying blocks of code or data, but it doesn't really generate code that can be reassembled without considerable work.

After a quick search I ran across YAZD - Yet Another Z80 Disassembler 
It does an excellent job of turning a program back into source code with labels.
It can even do things like include where a label was accessed from in a comment in the disassembly.
Sadly, it detected ZERO data in the VZ ROM file, so that's a problem unless I can figure out how to resolve that in the settings.

By passing the disassembler output through sed (unix) with a list of substitutions for known commands or ROM routines, we get something that actually looks a little like real source code for the ROM.  And by using sed, I can try as many options with YAZD, or other disassemblers, as I like and still recreate a labeled ROM file in seconds.  SWEET!   Now I need to find the book on Level II BASIC I was using for some of the comments.  If I can create a sed file to append comments automatically, this won't be so bad.




The Mandelbrot generator I posted before runs significantly faster with the new ROM.  It uses a lot of multiplication.  It also depends on SQR, if there were a way I could squeeze a faster version in, the program would probably run twice as fast on the new ROM as on the old one.

I stated porting the Fedora 3D plot program last night to see how much faster it is.  Converting the parameters for different screen heights is a pain, and the oblong pixels on the normal display don't help.

Factorial Generator

This provides a pretty good comparison of different machines.
It's not a perfect benchmark since different machines have different screen width and some scroll faster, but then the MC-10 is using higher precision math than many other machines.

The MC-10 is quite competitive here.  It won't keep up with a CoCo in high speed mode, but it beats most 1 MHz systems .

5 REM FACTORIAL GENERATOR
10 FOR Z=1 TO 100
20 FOR X=0 TO 33
30 GOSUB 80
40 PRINT Z;X;A
50 NEXT X
60 NEXT Z
70 END
80 A=1
90 IF X=0 THEN RETURN
100 FOR C=1 TO X
110 A=A*C
120 NEXT C
130 RETURN

Beta 2.1c online

Here's the new Beta version of my MC-10 ROM.  It's not quite as fast as some of the test versions but it doesn't seem to have any issues, the BREAK key is still fairly responsive, and math results should be identical to the original ROM..

The break key check interval is set to 32.  Even 40 felt too long.  Let me know if it seems responsive enough.


MC10_ROM_21c.bin Beta

Raid!!!!

I had a little time to fix the last issues in the floating point code.

The random number seed is based on the sub precision byte of the multiply (basically the 5th byte of a 4 byte floating point number).  It was never random even though the number it is based on should be changing with every multiply.
After changing the code so it works, and the accuracy reported by Ahl's Benchmark matches the original ROM.  But the change shouldn't have done anything different.
EDIT: The sub precision byte got moved to the wrong floating point accumulator at some point when I was dealing with another issue.

The other issue is actually finding 6 memory locations that are safe to use as temporary storage in the multiply.  Some trial and error may be involved because tracking the execution of an interpreter is a nightmare.

I think it's time for a Beta release.

Sunday, August 6, 2017

Around and round we go

This is the circles program after the modification to the MC-10 ROM BREAK key test.
The counter is set to 64 here, which is a bit long, but the new version is clearly faster than the old one on the left. I tried 32 and that seemed okay.  I'll probably have to keep it at 40 or less.

Arm the Lasers!

Now that I've just about exhausted the remaining free space in the MC-10's 8K ROM I may take some time to look at the VTEC VZ200 ROM. The message title comes from the fact that the machines were also released as the VTEC Laser if you hadn't figured that out already.

The VZ200 used a blatant ripoff of the LEVEL II BASIC ROM on the TRS-80 Model I.  They just made patches to the ROM to disable some of the commands and to add support of the different hardware.  They even filled in some unused space with sequential numbers.  Something like $FF, $FE, $FD, $FC, etc...  When you are ripping someone off, you don't want to make it too obvious right?

One thing they didn't do, is they didn't actually remove the code for the commands, they just removed them from the keyword table used by the tokenizer.  I released a ROM image back in 2007(?) with the startup message BASIC+ that I had patched with the a hex editor to add the keywords back in. (Someone had patched this before but the ROM image was apparently lost to time)  Any program written under BASIC+ that uses those commands will still run on the original ROM because the commands still have pointers in the token jump table!  This makes it much easier to port programs from LEVEL II BASIC.  There's just one little problem.  The screen editor and line buffer are shorter than the one on the TRS-80.  So you still can't just type in a lot of programs.

I spent a couple months worth of evenings creating a commented disassembly of the ROM that could be reassembled just like the MC-10 ROM I've been working on.  I had the idea of fixing a few things in the ROM including the short line length, and to reassemble the code with the unused space in a single block I could put something else in.  Sadly, that work was lost in a fire, but I may give it another go soon.  This time I'll probably automate the process a bit using comments from a disassembly of the TRS-80 LEVEL II BASIC ROM and a program to grab the comments from a scan and append them to the end of the lines from the disassembly.
The process shouldn't take long once I have the automation code worked out.  The disassembly will have to be recreated first.  FWIW, I don't plan on modifying the line buffer and editor to handle longer lines.  I have a program on the PC that will tokenize them and they will run just fine once they are loaded.  You can't edit the long lines, but it should run them just fine.  But I can still add a few optimizations I used on the MC-10 ROM to speed it up a little.  I can even create an HD64180/Z180 version that uses it's hardware multiply.

I do have several other projects I may work on.  It might be nice to write an entire project in a high level language again.  :D

The fix is in!

This took under a minute to implement.
It makes no noticeable difference on Ahl's benchmark since that depends heavily on the speed of the floating point routines.  But you can see a noticeable difference on the LIFE program I've been using to compare against the old ROM.  Life only uses addition and subtraction so it's a pretty good tool to compare general execution speed.

16 was just the first value I decided to try.  I've tried 32 and it feels okay on the emulator, but I'd want to see if it feels okay with real hardware.  I think response will suffer if I go much higher.




BSCAN  rmb 1 ; BREAK key scan counter



LE519 dec BSCAN ; decrement the counter to see if we should scan for BREAK
bne LE519B ; skip it if it's not time to scan
ldaa   #16 ; the number of times we skip scanning for break
staa          BSCAN ; Reset the counter
bsr           LE566             ; check for BREAK or PAUSE keys

And the keyboard slips through my fingers...

Someone just reminded me of another optimization I planned for the new MC-10 ROM.

It seems that when I hit the OH CRAP I HAVE TO START OVER panic button in the last few days of the contest, I just kind of... sort of... maybe... might have *cough* forgot *cough* about it since I didn't have a written to-do list and hadn't actually written the code yet.  And this is a big one.

You see, Microsoft BASIC scans for the BREAK key during program execution.  In fact, it scans for the BREAK key every time is passes through the main interpreter loop.  Every... single... time.  In fact, it is actually the first thing it does when it passes through the main interpreter loop.  You can find the code in the source/disassembly labeled as LE519.

This is one of the worst time wasters in the ROM and it's in every version of Microsoft BASIC that I've looked at.  You don't need to scan for BREAK that often.  A hack for the TRS-80 Model 1 that disabled this in Level II BASIC supposedly sped up the interpreter considerably, and a patch for the CoCo 3 ROM that uses the keyboard interrupt sped it up a lot as well.

The fix is actually quite easy.  Just have a counter stored on the direct page, decrement it every pass and if it's not zero, BNE past the test.  If it is equal, reset the counter and call the check for BREAK.
That's only 4 new instructions. or around 8 bytes.
Then it's just a matter of tuning the counter setting so eliminates as many calls as we can get away with while keeping the BREAK key feeling responsive.

I believe this is commonly referred to as "low hanging fruit" and it should have been written first.
The good new is, there should still be enough free space in the ROM to implement it.
This should also fit in the original ROM just by dropping the programmer's name and/or the Microsoft Easter eggs.

FWIW, this code hasn't been implemented yet because:
1. It's simple enough I thought I could do it at the last minute
2. I was trying to free up ROM space for ELSE first and that also sped up the ROM.
3. I was trying to work out more difficult optimizations to see if I could work them into the contest.
4. I didn't write down the to-do list.


As far as space for more optimizations goes...

If I create a runtime only version of the interpreter, the break key test could be dropped entirely, CSAVE could be dropped, and part of the parser/tokenizer might go as well.  That would free up some space.  You'd have to write your code on the slower ROM and run it on the faster one after it's debugged.  Without the constant check for BREAK you'd have to wait for the program to perform keyboard input, or hit reset if you want to quit.  

If a sound chip were added, the SOUND() function could be made smaller.
The settings could probably be calculated and by using the hardware timer, the wait to turn off the sound could be simplified.  This seems really large for what it does.

At some point, I would prefer to just go to a 16K ROM and only be backwards compatible at the source level, or through machine language calls via the vectors at the end of the ROM.  I should just redo the Memory setup so interrupt vectors are below video RAM, programs are stored above the screen, and variables/stack are stored below the hi-res screen.  

Other thoughts...

Video really needed to start at a higher or lower memory address to give a larger contiguous block of memory to BASIC.. $1000 would have been a much better and the decoding logic wouldn't have been much different.  It's like they intentionally crippled the machine in as many ways as they could think of!  :/

One of the things I want to do in the future, is to support the 6303, and that would require moving the Floating point vector that resides at the same address as the illegal instruction interrupt vector for the 6303.
Is anything even using that?  I've already broken the memory layout of the floating point registers to make the multiply faster, though that's easy to fix.  

There should really be vectors for the entire floating point library.  Multiply, divide, SQR, ^, SIN, COS, TAN, conversion between integers and floats, etc...  This is one of the major flaws I see with the ROM.  Any program that wants to use these functions must assume they are at the original address.  If you want to use them for a floating point library for C, Pascal, whatever, your program can only run on the original ROM.

Saturday, August 5, 2017

Something different

From another project I've been working on.  A cross platform library for old computers.
This shows off the code that lets you print text on a "hi-res" graphics screen.  The code prints 64 characters per line on a 256 pixel wide screen or 80 characters on a 320 pixel wide screen.  I have versions for several machines.  The video shows the regular 6502 Atari version, 6803 MC-10 version, 65816 enhanced Atari version, and Z80 VZ200 version. Some of the code has been ported to the 68hc11, 6809, and several other machines.  As you can see, it's pretty fast for being software based, so it shouldn't slow down projects that use it.

The original goal for this part of the library was to create portable text adventure engines for the Infocom (Zork) amd similar games in C.  Being able to print in upper and lower case with a reasonable number of characters per line seemed like a good idea where most 8 bit computers only print 32 or or 40 characters per line, and many only support upper case text.  After loosing a lot of code in a house fire, the original project took a major step back.  Since then I've changed goals for the cross platform library.  I'll discuss that at some point in the future.

Part of the reason this caught my interest is it's something I wrote for an embedded system years ago. Versions of the Z80, 6803 an 6502 source code are already posted on the internet, but it's not the most recent versions of the code..
Here is a side by side comparison of the old and new MC-10 BASIC ROMs running the same program.  Microsoft's version is on the left, and my updated version is on the right.

Friday, August 4, 2017

The fast multiply seems to be working now.  It still needs extensive testing and I still have to fix a couple related issues, but that should be easy compared to what I've already had to fix.

So, how fast is it?  Well, we need to do some benchmarking to figure that out.

Ahl's benchmark was a popular BASIC benchmark back in the day.  It was one of the few speed comparisons for Personal Computers in the late 70s an early 80s.  It's not really a great benchmark by today's standards, but it did a decent job of measuring the speed and accuracy of the BASIC math library.  It does zero testing of things like string handling, array manipulation, etc...  Math yeah, no to just about everything else.

Ahl's Benchmark currently lists the accuracy of the new math library at 8.66413117E-4.  That's a higher error than the original ROM, but it's still better than the standard 6502 versions. (FWIW, the accuracy might improve before the ROM image is released.)  The benchmark takes 67 seconds, which is 52 seconds faster than the factory ROM.  That's almost a 44% faster.  In the Creative Computing article that published results for 140 different computers from fastest to slowest, the MC-10 was originally listed about 2/3 of the way down.  With this and the other optimizations I've made, it jumps past over 30 machines on the list and into the top half .  Radio Shack was closing these out for as little as $10 when it was discontinued even though it is faster than machines costing in the thousands at the time.

I also ran a sieve test.  With the first test, the new version is only slightly faster because sieve uses division rather than multiplication.  It totally depends on the other optimizations I've made which make a much smaller difference.  This brings up the possibility of another optimization.  It is normally faster to calculated the reciprocal of the divisor and multiply rather than to use division. Thanks to the hardware multiply, the difference should be even greater.  Example... if we want to do this  C = A/B it will be faster to do this C=A*1/B.   But isn't 1/B division?  It is.  This is mostly for pre-calculating constants you divide by or where the reciprocal can be calculated outside a loop the value is used repeatedly in.

That brings up another issue.  As of now, there are only 14 bytes of free space remaining in the ROM, and that's without printing "MICROCOLOR BASIC", so the 8K version is pretty much done unless I find some additional code size optimizations.  That's too bad, the fast SQR function would probably put the Ahl results under 60 seconds.  Even a faster ^ (power) function might do that.  If I could get it under 50 seconds, that's normally the realm of systems clocked twice as fast.  That's around what the Model 4 benchmarked at and it runs a 4MHz Z80.

I have to admit, it felt good to get this working!  This was one of my original personal goals for the project and it was a bit disappointing that I didn't have time to put it in the contest release.  Also, things weren't looking good this morning when Ahl's Benchmark listed accuracy at over 4000... that's with no exponent.  The error was that bad!  I tried looking up 6803, 6809, or 68hc11 floating point routines that use the hardware integer multiply and found nothing.  The CPU and it's siblings have been out almost 40 years and nobody has even tried this?  I couldn't even find something for other processors.  I'm not saying it doesn't exist, but I eventually gave up looking for it.  I deleted some code and wrote it from scratch... problem solved.

The code is pretty straightforward, but people might wonder why I did a couple things.   If you get to look at the code, just be aware there is one semi-clever thing in the multiply I did to save a few instructions, and you'll be wondering what the heck I was thinking until you figure it out.

Once a "final" release version is out, maybe I'll take some time to make this blog look a little neater.
I'm just not that into the whole blogging thing.


*edit*

I finally found a floating point library that uses the multiply instruction.  It's in the 68hc12 floating point library posted in the GNU 68HC11/HC12 group on yahoo.  The code makes a similar optimizations to the multiply as my code and uses the additional divide instructions from the 68hc12.

Thursday, August 3, 2017

After a couple month hiatus, I'm working on the last few optimizations that were written during the retro challenge contest but didn't make it into the code.  That includes a fast multiply that uses the MUL instruction in place of the usual looped add and shift code, and storing the pointer to the next line to speed up line parsing a bit among other things.

I don't think I said much if anything about the fast multiply during the contest because I didn't know if I had enough time or ROM space for it.  In other words, don't promise what you can't deliver.  I really wanted it to be in the contest release because that's where the most obvious speed improvement was going to come from.  Sadly, there wasn't enough time to finish it.  It still needs work, but initial tests show Ahl's Benchmark finishing in about 1 minute 7 seconds.  That's 52 seconds less than the Microsoft ROM, and over 30 seconds faster than the 1MHz 6502 machines.

The code that saves the address of the next line probably works, I just need to make sure the stack frame is correct.  This would have been in earlier, but I wanted to get the math optimizations right first.  Too many changes were introduced in a short time at the end of the contest and I had to back everything out and debug what I had time for.  I'm not sure it will fit anymore though.

There have been some other optimizations that may work their way into a release eventually.
I experimented with 16 bit memory moves.  Just one of those cut a full second off the startup time of a game.  It had a bug though, so it's on hold.  
The fast SQR function is way too big.  It will require a larger ROM, but it would make a noticeable difference on Ahl's benchmark and anything else that uses SQR.  Maybe if I create a 16K ROM.
A more intelligent ^ (power) function might offer a speedup as well, but I haven't looked at that yet..

If I just had another 1K-2K to work with.