Here's a nice little story related to the 6809. It shows of one of the more interesting optimizations you can use with the 6809. It's also neat to see that people came up with similar solutions completely isolated from each other.
Link
My MC-10 (6803) 64 column graphics text code's screen scroll also uses the stack register as the destination pointer for similar reasons, but there are differences vs the 6809.
Each register PUSHed or PULLed requires a separate instruction, where the 6809 can PUSH or PULL multiple registers with a single instruction. As a result, he 6803 code looks more like their earlier code.
With only one stack pointer, you have to use the index register for the other source or destination pointer, and the offset is only 1 byte, so you can only go up to 254 with LDD #,X before you have to change X. The code looks like this, and it's unrolled for a 256 byte section of the screen:
LDD #255,x ; 2 bytes, 5 clock cycles
PSHB ; 1 byte, 3 clock cycles
PSHA ; 1 byte, 3 clock cycles
LDD #254,x ; 2 bytes, 5 clock cycles
PSHB ; 1 byte, 3 clock cycles
PSHA ; 1 byte, 3 clock cycles
Link
My MC-10 (6803) 64 column graphics text code's screen scroll also uses the stack register as the destination pointer for similar reasons, but there are differences vs the 6809.
Each register PUSHed or PULLed requires a separate instruction, where the 6809 can PUSH or PULL multiple registers with a single instruction. As a result, he 6803 code looks more like their earlier code.
With only one stack pointer, you have to use the index register for the other source or destination pointer, and the offset is only 1 byte, so you can only go up to 254 with LDD #,X before you have to change X. The code looks like this, and it's unrolled for a 256 byte section of the screen:
LDD #255,x ; 2 bytes, 5 clock cycles
PSHB ; 1 byte, 3 clock cycles
PSHA ; 1 byte, 3 clock cycles
LDD #254,x ; 2 bytes, 5 clock cycles
PSHB ; 1 byte, 3 clock cycles
PSHA ; 1 byte, 3 clock cycles
etc...
LDX ROWADDRESS+254 ; 3 bytes, 5 clock cycles
PSHX ; 1 byte, 4 clock cycles
LDX ROWADDRESS+252
PSHX
LDX ROWADDRESS+250
PSHX
etc...
Using PSHX saves 22 - 9 = 13 clock cycles per pair of bytes moved, or 13 * ((32/2)*(192-8)) = 38,272 clock cycles per scroll! The code size also half then number of bytes per pair of bytes moved.
So why didn't I do that?
So why didn't I do that?
That may not be a big deal of you have a large RAM expansion, but it's not practical for most MC-10's. However, if you wanted to implement 4 rows of text at the bottom of the screen similar to the Apple II and several other 8 bit machines, then it's not so bad.
The latest code generates the scroll code on the fly at startup, so I could generate either version of the code depending on the hardware you have. We'll see.