Friday, August 18, 2017

Google is building new OS from scratch

Just a few ramblings related to Google's new OS, Fuchsia.  This isn't related to my current projects, but it could prove to be important in the future.

Why a new OS?  Companies have been basing more and more systems on Linux.  It has almost become the default for building new systems.  But should it be that way, and if not, what should they use?

I've personally spent years developing for various flavors of Unix (Unix, Solaris, Linux, etc...) and several years working on embedded systems.  I really like Unix/Linux as a development environment, but it seems inappropriate as the basis for many systems, especially embedded systems.  It can also present a rather complex environment from a programming standpoint depending on what parts of the OS your code has to interact with.

Linux is based on Unix.  There's no getting around it, Linux is pretty much a free implementation of Unix.  Unix was designed during the computing world of the late 60's and early 70s.  Most programs ran in batch mode, they didn't require much if any networking, they all ran from the console, and they weren't written in an object oriented programming language.  The flexibility provided for console based applications is nothing short of amazing, but then it was a console oriented world, and Unix was designed as a development environment.  Don't get me wrong, Unix has seen a lot of development since then, but that is it's roots.  GUIs didn't exist, the internet didn't exist, object oriented programming didn't really catch on for over a decade after Unix was first released, and personal computers didn't even exist yet.  Unix was designed for mini-computers, not for something that fits in the palm of your hand.  The fact that it does work for such applications is a testimony to Unix's flexibility and robustness of the design.  It's also testimony to the power of modern processors and how small large amounts of RAM and mass storage have become.

Unix is neither lightweight, nor simple.  The number of separate modules that must be loaded just to boot is a bit mind numbing.  Anyone that's watched the stream of messages when Linux boots should recognize that there must be a faster and simpler way, especially for embedded systems.  And from a programming standpoint, the Unix internals can require you to write more code, which requires more development time, which requires more testing and debugging, etc... than a lightweight OS design would require.  Programmers have addressed this through using libraries of code, templates, isolating the programmer from it with languages like Java, etc... but ultimately, you are building more and more code which requires more and more memory and more and more CPU time.  What if you could eliminate a lot of that by dealing with it at the OS level in the first place, and by simplifying the programming interface?

I pose a few simple questions.  First, why should a 25MHz Amiga from 1990 boot from an old, slow hard drive in the same or less time than a 1+ GHz router booting from FLASH or a high speed SD interface?  And second, does it need to be this way?  The answer to the first question is simply, it shouldn't, and the answer to the second question is obviously, it doesn't.  It is simply that way because companies don't want to invest the time and money for the development of an alternative OS, and there really aren't any existing options that compete with it capability wise for free.  One embedded project I worked on did develop it's own real time OS.  It was written from scratch to duplicate the APIs of a commercial real time OS.  The license fees for the commercial OS were more expensive than paying a programmer to write it from scratch.  But then, Linux didn't exist yet, so there wasn't a free alternative.  One of the takeaways from that project was that the box booted in a manner of seconds even with the self test during startup.  Linux would have taken over a minute and would have required more memory.  How many products would work better if they were based on something else?  Should it really take that long to boot a router?  How about a TV?  Seriously, why does my smart TV take 30 seconds to boot?

Google's new OS may or may not addresses these issues, but additional options are certainly welcome.  It certainly won't replace Linux for everything, but hopefully, it will provide an alternative for systems where Linux doesn't seem appropriate.  If it can reduce development times, memory footprint, boot times, etc... it will be a very attractive alternative.  More importantly, it will stir things up a bit in a world that has become entrenched around Linux.  And while Fuchsia may not be the answer to these problems, but maybe an offshoot of it will.  It is open source after all.  It will be interesting to see what this project leads to.

Article Link

Wednesday, August 16, 2017

The 68hc11 board doesn't like talking to PC USB serial adapters.  This could be a problem.

Tuesday, August 15, 2017


My new to me 68hc11 board is waiting at the post office!  It's a much wanted (I don't NEED it mind you) replacement for the one that was damage in a fire.  For under $15 off of ebay shipped, that's a bargain if it works!  I'm guessing this is based off of the Buffalo monitor ROM.

Expect a 68hc11 version of Microsoft like BASIC in the future (yet another project to divide my time further)  There aren't going to be a lot of changes to the main interpreter.  Memory moves won't need to use the stack pointer, I can dedicate the Y register to pointing to the next character for large sections of the interpreter.  I/O would have to completely change though.  The bit banger cassette I/O will probably be replaced with an IDE port and simple DOS.  Video out and keyboard will be through a terminal (Buffalo calls?).  I may decide to stick on a V9958 and PC keyboard interface.

The hc11 has a few added instructions that may come in handy.  The hc11 supports XGDX like the 6803 and XGDY which might allow the use of X and Y for temporary storage without having to use direct page RAM in a few places.  It's only a space saving optimization if I'm storing 8 bit registers in X as it takes 3 clock cycles and using Y requires 4.  It would be faster or the same speed to save D in X though.  Y requires 2 byte opcodes, so it's slower than X and slower than using the stack pointer.but you don't have to disable interrupts like with the stack pointer. It's definitely faster than constantly changing X.

There is a version of GCC for the 68hc11, so that's a definite improvement over the 6803.

Monday, August 14, 2017

You can find a good comparison of different square root algorithms at this link.

A couple tweaks later...

After a couple small tweaks to normalization code and a one long test later... there is a much greater speed increase than with the previous test.  I'm not sure the small change that was made can account for this.  Maybe I had the wrong ROM loaded before.  Perhaps a bad build and I missed it?

The Life comparison shows the new version to be at least 3/4 of a generation ahead after 100 generations.  I'll take another 3/4 of 1% for a single change if this is consistent across the board.
However, tests with the circle drawing code indicate there are far too many instances where minimal normalization is required, and the code is obviously slower.  This leaves my previous version as the better choice overall.

The previous post reflects the changes I just tested.

Some normalization code

Here is the original MC-10 ROM routine for normalizing a floating point number 8 bits at a time.  After this it rotates 1 bit at a time.

;* Normalize FPA0
LEFD6     clrb                          ; exponent modifier = 0
LEFD7     ldaa      FPA0                ; get hi-order byte of mantissa
          bne       LF00F               ; branch if <> 0  (shift one bit at a time)

;* Shift FPA0 mantissa left by 8 bits (whole byte at a time)
          ldaa      FPA0+1              ; byte 2 into..
          staa      FPA0                ; ..byte 3
          ldaa      FPA0+2              ; byte 1 into..
          staa      FPA0+1              ; ..byte 2
          ldaa      FPA0+3              ; byte 0 into..
          staa      FPA0+2              ; ..byte 1
          ldaa      FPSBYT              ; sub-precision byte..
          staa      FPA0+3              ; ..into byte 0
          clr       FPSBYT              ; 0 into sub-precision byte
          addb      #8                  ; add 8 to the exponent modifier
          cmpb      #5*8                ; has the mantissa been cleared (shifted 40 bits)?
          blt       LEFD7               ; loop if less than 40 bits shifted

Here is the current test code.
FPSBYT only needs cleared the first pass, but I don't see a way of getting ride of it without unrolling the loop once.  Positioning FPSBYT after FPA0 would let us speed this up, but it would be at the cost of a few clock cycles in the multiply.  But normalization should take place more often than multiplication, so I may test that at some point to see if it's an improvement.  The ldab at the end is commented out because it isn't needed.  If we get that far the mantissa is zero.

;* Normalize FPA0
ldx #5 ; loop a maximum of 5 times (cleared mantissa)
ldaa  FPA0                                ; get hi-order byte of mantissa
   bne  LEFFFa                            ; branch if <> 0  (shift one bit at a time)
   ;* Shift FPA0 mantissa left by 8 bits (whole byte at a time)
   ldd FPA0+1
std FPA0
ldaa FPA0+3
std FPA0+2
clr   FPSBYT                          ; 0 into sub-precision byte
dex                                  ; has the mantissa been cleared (shifted 40 bits)?
bne         LEFD7                             ; loop if less than 40 bits shifted
; ldab #8*5  ; set the exponent modifier where x = 0

then we calculate the exponent modifier right before the single bitshift code.  If A is negative, we skip the next code, if not, we directly fall into shifting bits.

LEFFFa ;calculate exponent modifier
ldab #5 ; max # of times we shifted the mantissa
subb TEMPM+1 ; actual number of times we shifted the mantissa
; no need to test result, always > 0
rolb ;* 2 ; muliply 8 for 8 bits per byte
rolb ;* 4
rolb ;* 8
tsta ; Is A positive or negative?
  bmi LF00Fa

If we use the 6303 we can do this.  The xgdx instruction exchanges the contents of D and X.  So in one XGDX  instruction, we load the address of the table-1 into X, and the table offset into B.
The table address minus 1 is because this is never called when x=0.  The table itself is only 5 bytes with pre-calculated values for the exponent modifier, and the table can reside anywhere it ROM.  Technically, 8*5 is never loaded because this isn't called when X goes to zero, so we just leave it off and skip subtracting from X or B by loading the xpmtable pointer -1.

LEFFFa ;calculate exponent modifier, 6303 version
ldd #xpmtable-1 ; get the address of the exponent modifier table
xgdx ; exchange D and X
abx ; point X to modifier in table
ldab ,x ; load it
tsta ; is A positive or negative?
bmi LF00Fa
fcb 8*4, 8*3, 8*2, 8, 0

Sunday, August 13, 2017

Faster Normalization

I tested a new floating point normalization routine last night.
The loop that performs the normalization is definitely faster, but the adjustment to the exponent is calculated outside the loop.  This makes the routines slightly slower if no byte oriented normalization is needed, only slightly faster if one pass is needed, and definitely faster if 2-5 passes are needed.

The problem with an optimization like this is that it's difficult to tell which is faster in real world use.  You can't just count clock cycles.   The only sure test is benchmarks.  I ran the Life program I've been using side by side with the previous ROM version.  After running overnight... the new version is definitely faster.  But the Life status bar is only about 2 blocks different after over 200 generations.  More testing and a size comparison will be needed to see if it stays.  If it's always faster and within a few bytes of the previous version, I'll keep it in.  It's definitely faster and smaller than the Microsoft version which only shifted the mantissa a byte at a time with the A register.  It's very obvious the original 6800 code was used here.