If you play with some graph paper drawing lines, you will see that all X-Major lines have run lengths that are either L or L+1 pixels (Put another way: the length of the runs are one of two lengths, and they vary by one pixel at the most). Therefore, all we have to do is calculate L and figure out how to decide whether to use L or L+1 when drawing runs.
If we drew YDelta-1 runs that were L long, the length of the line would be short by Rem pixels. All we have to do to get a great looking line then, is to distribute those Rem pixels over the length of the line. To do so, we will use the following fraction:
Here is my almost final implementation of the routine... I'm still trying to find a way to get rid of one of those DIV instructions, though. (at least they're only 16 bit DIVs...)
Adding an fast X-Major line to our library is the final part we need to have a fast general purpose line. Already, we have vertical, horizontal, and Y-major lines. With this final part, we can assemble all of the previous pieces into a single easy to use "Line" procedure.
In all of the previous pieces, we have stepped along the major axis of the line and filled in pixels. This eliminates the chance of gaps in the line, and is very intuitive. This procedure, however, goes against that conventional wisdom and steps along the minor axis of the line. It is very fast, because instead of plotting individual pixels, we will draw whole runs of pixels at a time. Here is an illustration to show what I mean:
| *******
M | *******
I A | *******
N X | *******
O I | *******
R S | *******
| *******
---------------------------------------------------
MAJOR AXIS
The first thing that we have to do, (Like almost all of our line functions) is to calculate DeltaX and DeltaY. This is done as follows:
DeltaX := X2-X1+1; { Calculate deltas }
DeltaY := Y2-Y1+1;
We know the number of runs from this (DeltaY-1), and we can also figure out L. It turns out that L is the reciprocal of the slope of the line:
Slope := DeltaY / DeltaX; { Rise over run }
L := 1 / Slope; { Run over rise }
Of course, in this modern society of ours, we would never use floating point numbers. Besides, the above simplifies to this:
L := DeltaX DIV DeltaY; { Throw away decimal... }
Rem : DeltaX MOD DeltaY; { Rem = the remainder }
Notice that these two lines compact to one DIV in assembler, so the MOD is just temporary...
Incr := Rem / DeltaY;
Every time we go through the loop, we add this to the total. When the total overflow (goes over one), we draw the run with an extra pixel on it. It works great! Here is a working example of the above...
PROCEDURE RunLine(X1, Y1, X2, Y2 : INTEGER; Color : BYTE);
VAR
DeltaX, DeltaY : INTEGER;
ShortLen : INTEGER;
X, Y : INTEGER;
Rem : INTEGER;
Total, Incr : REAL;
PROCEDURE DrawRun(Length : INTEGER);
VAR I : INTEGER;
BEGIN
FOR I := X TO Length+X-1 DO
SetPixel(I, Y, Color);
INC(Y);
X := X + Length;
END;
BEGIN
DeltaX := X2-X1+1;
DeltaY := Y2-Y1+1;
X := X1; Y := Y1;
ShortLen := DeltaX DIV DeltaY; { These two lines }
Rem := DeltaX MOD DeltaY; { become a single DIV }
Total := 0.5; { Start half way in }
Incr := Rem / DeltaY;
WHILE Y <= Y2 DO { Draw DeltaY-1 runs }
BEGIN
Total := Total + Incr;
IF Total <= 1 THEN
DrawRun(ShortLen)
ELSE
BEGIN
DrawRun(ShortLen+1);
Total := Total - 1.0;
END;
END;
END;
This is great and everything, but what's with all of the real numbers! Well, this is where the fixed point part comes in... Notice the "overflowing" in this routine... it converts nicely to fixed point. Here is a pseudo-code fixed point version:
PROCEDURE RunLine(X1, Y1, X2, Y2 : INTEGER; Color : BYTE);
VAR
DeltaX, DeltaY : INTEGER;
ShortLen : INTEGER;
X, Y : INTEGER;
Rem : INTEGER;
Total, Incr : WORD;
BEGIN
DeltaX := X2-X1+1;
DeltaY := Y2-Y1+1;
X := X1; Y := Y1;
ShortLen := DeltaX DIV DeltaY; { These two lines }
Rem := DeltaX MOD DeltaY; { become a single DIV }
Total := 32768; { .5 SHL 16 }
Incr := (LONGINT(Rem) SHL 16) DIV DeltaY;
WHILE Y <= Y2 DO
BEGIN
Total := Total + Incr;
IF Overflow THEN
DrawRun(ShortLen+1) { This uses the same DrawRun as before }
ELSE
DrawRun(ShortLen);
END;
END;
Notice now that Total and Incr become WORD variables instead of REAL variables, and we lose a floating-point divide. Now here's a nice loop to replace the WHILE loop in the above:
les DI, Screen.Buffer
mov BX, Y1
add BX, BX
add DI, WORD [BX+Screen.YTable]
mov BX, 32768 { 0.5 SHL 16 }
mov SI, Incr
mov DX, ShortLen
mov AH, BYTE [DeltaY]
mov AL, Color
xor CX, CX
@InnerLoop:
add BX, SI { Update our fraction... }
adc CX, DX { If it overflows, then draw a long line! }
rep stosb { Draw the run }
add DI, Screen.Width { Next Y! }
dec AH
jnz @InnerLoop { Try again }
Another nice thing about fixed point numbers, is that (if implemented right) they tend to get rid of conditional jumps... Compared to the original floating point routine, this one SCREAMS! You may notice that this routine doesn't use 32 bit registers. Instead, it uses 0.16 fixed point format. That means that no bits are allocated to holding the integer portion of the number. When the number in BX exceeds (or equals) one, the bits overflow (setting the overflow bit), but fall into the void of nothingness... Which is very convienient for us. (But the poor little bits!)
PROCEDURE RunLine(X1, Y1, X2, Y2 : INTEGER; Color : BYTE); ASSEMBLER;
ASM
les DI, Screen.Buffer
mov BX, Y1
add BX, BX
add DI, WORD [BX+Screen.YTable] { DI := Y1*320+X1 }
add DI, X1
mov AX, X2 { DeltaX := X2-X1+1 }
mov BX, Y2 { DeltaY := Y2-Y1+1 }
sub AX, X1
sub BX, Y1
inc AX
inc BX
{ ShortLen := DeltaX DIV DeltaY
Rem := DeltaX MOD DeltaY }
xor DX, DX { Clear upper word }
div BX { AX = ShortLen DX = Rem }
mov CX, AX { CX = ShortLen }
xor AX, AX { Rem is already SHL 16 }
div BX { Incr := (LONGINT(Rem) SHL 16) DIV DeltaY }
mov SI, AX { SI := Incr }
mov DX, CX { DX = ShortLen }
mov AH, BL { AH = DeltaY }
mov AL, Color
xor CX, CX { Initialize CX to zero }
mov BX, 32768 { BX = 0.5 SHL 16 }
@InnerLoop:
add BX, SI { Update our fraction... }
adc CX, DX { If it overflows, then draw a long line! }
rep stosb { Draw the run (Sets CX to zero) }
add DI, Screen.Width { Next Y! }
dec AH
jnz @InnerLoop { Try again }
END;
Things to be careful of though: Make sure that the direction flag is clear, and this won't draw
a line that goes up... Until the next section....
You know the drill... Now we use self modifying code, and unroll the loop so it's super fast... Here's the nifty self-modifying version:
PROCEDURE RunLine(X1, Y1, X2, Y2 : INTEGER; Color : BYTE); ASSEMBLER;
ASM
les DI, Screen.Buffer
mov BX, Y1
add BX, BX
add DI, WORD [BX+Screen.YTable] { DI := Y1*320+X1 }
add DI, X1
mov BX, X2 { DeltaX := X2-X1+1 }
mov AX, Y2 { DeltaY := Y2-Y1+1 }
sub BX, X1
sub AX, Y1
cwd
xor AX, DX
sub AX, DX { DeltaX := ABS(DeltaX) }
and DL, 00101000b { IF negative, DX is -1 }
{ Here, SUB has 5 in the Opcode field, where add has 0 }
and BYTE [CS:@ModifyAdd+1], 11000111b
or BYTE [CS:@ModifyAdd+1], DL
mov CX, Screen.Width { Modify the opcode field }
mov WORD [CS:@ModifyAdd+2], CX { Stick in the correct Screen.Width }
inc BX
inc AX
xchg AX, BX
{ ShortLen := DeltaX DIV DeltaY
Rem := DeltaX MOD DeltaY }
xor DX, DX { Clear upper word }
div BX { AX = ShortLen DX = Rem }
mov CX, AX { CX = ShortLen }
xor AX, AX { Rem is already SHL 16 }
div BX { Incr := (LONGINT(Rem) SHL 16) DIV DeltaY }
mov SI, AX { SI := Incr }
mov DX, CX { DX = ShortLen }
mov AH, BL { AH = DeltaY }
mov AL, Color
xor CX, CX { Initialize CX to zero }
mov BX, 32768 { BX = 0.5 SHL 16 }
@InnerLoop:
add BX, SI { Update our fraction... }
adc CX, DX { If it overflows, then draw a long line! }
rep stosb { Draw the run (Sets CX to zero) }
@ModifyAdd:
add DI, 12345 { Next Y! }
dec AH
jnz @InnerLoop { Try again }
END;
Notice that the self modifying portion doesn't use a jump in this version... All that differs between an ADD REG16, IMM16 and a SUB REG16, IMM16 is the opcode field of the ModR/M byte. This means that we just have to change that byte... Anyways, the next routine is the very, very fast one that we'll put in our library... It uses loop unrolling:
PROCEDURE RunLine(X1, Y1, X2, Y2 : INTEGER; Color : BYTE); ASSEMBLER;
ASM
les DI, Screen.Buffer
mov BX, Y1
add BX, BX
add DI, WORD [BX+Screen.YTable] { DI := Y1*320+X1 }
add DI, X1
mov BX, X2 { DeltaX := X2-X1+1 }
mov AX, Y2 { DeltaY := Y2-Y1+1 }
sub BX, X1
sub AX, Y1
cwd
xor AX, DX
sub AX, DX { DeltaX := ABS(DeltaX) }
and DL, 00101000b { IF negative, DX is -1 }
{ Here, SUB has 5 in the Opcode field, where add has 0 }
add DL, $C7 { Rest of the MOD R/M byte }
mov BYTE [CS:@ModifyAdd1+1], DL
mov BYTE [CS:@ModifyAdd2+1], DL
mov BYTE [CS:@ModifyAdd3+1], DL
mov BYTE [CS:@ModifyAdd4+1], DL
mov CX, Screen.Width { Modify the opcode field }
mov WORD [CS:@ModifyAdd1+2], CX { Stick in the correct Screen.Width }
mov WORD [CS:@ModifyAdd2+2], CX { Stick in the correct Screen.Width }
mov WORD [CS:@ModifyAdd3+2], CX { Stick in the correct Screen.Width }
mov WORD [CS:@ModifyAdd4+2], CX { Stick in the correct Screen.Width }
inc BX
inc AX
xchg AX, BX
{ ShortLen := DeltaX DIV DeltaY
Rem := DeltaX MOD DeltaY }
xor DX, DX { Clear upper word }
div BX { AX = ShortLen DX = Rem }
mov CX, AX { CX = ShortLen }
xor AX, AX { Rem is already SHL 16 }
div BX { Incr := (LONGINT(Rem) SHL 16) DIV DeltaY }
mov SI, AX { SI := Incr }
mov DX, CX { DX = ShortLen }
mov AH, BL { AH = DeltaY }
mov AL, Color
xor CX, CX { Initialize CX to zero }
push BP
mov BP, BX
mov BX, 32768 { BX = 0.5 SHL 16 }
and BP, 3 { Get remainder of CX DIV 4 }
shr AH, 2 { Repeated four times }
add BP, BP { Index into a word table }
jmp WORD PTR CS:[@JumpTable+BP] { Clears the cache }
@JumpTable:
dw OFFSET @Iteration5 { Enter at the correct place... }
dw OFFSET @Iteration4
dw OFFSET @Iteration3
dw OFFSET @Iteration2
@Iteration1:
add BX, SI { Update our fraction... }
adc CX, DX { If it overflows, then draw a long line! }
rep stosb { Draw the run (Sets CX to zero) }
@ModifyAdd1:
add DI, 12345 { Next Y! }
@Iteration2:
add BX, SI { Update our fraction... }
adc CX, DX { If it overflows, then draw a long line! }
rep stosb { Draw the run (Sets CX to zero) }
@ModifyAdd2:
add DI, 12345 { Next Y! }
@Iteration3:
add BX, SI { Update our fraction... }
adc CX, DX { If it overflows, then draw a long line! }
rep stosb { Draw the run (Sets CX to zero) }
@ModifyAdd3:
add DI, 12345 { Next Y! }
@Iteration4:
add BX, SI { Update our fraction... }
adc CX, DX { If it overflows, then draw a long line! }
rep stosb { Draw the run (Sets CX to zero) }
@ModifyAdd4:
add DI, 12345 { Next Y! }
@Iteration5:
dec AH
jns @Iteration1 { Try again }
pop BP { 6.2 cycles per pixel! }
END;
Although cycle counts depend on how long the run lengths are, we can estimate it. Lets say that
the pixel run length averages to about five pixels long. This means that it takes:
@Iteration1: { 0 cycles }
add BX, SI { 1 cycle }
adc CX, DX { 1 cycle }
rep stosb { 27 cycles = 7+4*CX }
@ModifyAdd1: { 0 cycles }
add DI, 12345 { 1 cycle }
{ Total: 30 cycles }
dec AH { 1 cycle }
jns @Iteration1 { 3 cycles if taken }
{ Total: 4 cycles }
This means that it takes 30*4 (Unrolled four times) + 4 (Loop code) cycles = 124 cycles. That is
124 cycles for 20 pixels. Total: 6.2 cycles per pixel. Not bad!
PROGRAM RunLengthTester;
USES GraphPro;
PROCEDURE RunLine(X1, Y1, X2, Y2 : INTEGER; Color : BYTE); ASSEMBLER;
ASM
les DI, Screen.Buffer
mov BX, Y1
add BX, BX
add DI, WORD [BX+Screen.YTable] { DI := Y1*320+X1 }
add DI, X1
mov BX, X2 { DeltaX := X2-X1+1 }
mov AX, Y2 { DeltaY := Y2-Y1+1 }
sub BX, X1
sub AX, Y1
cwd
xor AX, DX
sub AX, DX { DeltaX := ABS(DeltaX) }
and DL, 00101000b { IF negative, DX is -1 }
{ Here, SUB has 5 in the Opcode field, where add has 0 }
add DL, $C7 { Rest of the MOD R/M byte }
mov BYTE [CS:@ModifyAdd1+1], DL
mov BYTE [CS:@ModifyAdd2+1], DL
mov BYTE [CS:@ModifyAdd3+1], DL
mov BYTE [CS:@ModifyAdd4+1], DL
mov CX, Screen.Width { Modify the opcode field }
mov WORD [CS:@ModifyAdd1+2], CX { Stick in the correct Screen.Width }
mov WORD [CS:@ModifyAdd2+2], CX { Stick in the correct Screen.Width }
mov WORD [CS:@ModifyAdd3+2], CX { Stick in the correct Screen.Width }
mov WORD [CS:@ModifyAdd4+2], CX { Stick in the correct Screen.Width }
inc BX
inc AX
xchg AX, BX
{ ShortLen := DeltaX DIV DeltaY
Rem := DeltaX MOD DeltaY }
xor DX, DX { Clear upper word }
div BX { AX = ShortLen DX = Rem }
mov CX, AX { CX = ShortLen }
xor AX, AX { Rem is already SHL 16 }
div BX { Incr := (LONGINT(Rem) SHL 16) DIV DeltaY }
mov SI, AX { SI := Incr }
mov DX, CX { DX = ShortLen }
mov AH, BL { AH = DeltaY }
mov AL, Color
xor CX, CX { Initialize CX to zero }
push BP
mov BP, BX
mov BX, 32768 { BX = 0.5 SHL 16 }
and BP, 3 { Get remainder of CX DIV 4 }
shr AH, 2 { Repeated four times }
add BP, BP { Index into a word table }
jmp WORD PTR CS:[@JumpTable+BP] { Clears the cache }
@JumpTable:
dw OFFSET @Iteration5 { Enter at the correct place... }
dw OFFSET @Iteration4
dw OFFSET @Iteration3
dw OFFSET @Iteration2
@Iteration1:
add BX, SI { Update our fraction... }
adc CX, DX { If it overflows, then draw a long line! }
rep stosb { Draw the run (Sets CX to zero) }
@ModifyAdd1:
add DI, 12345 { Next Y! }
@Iteration2:
add BX, SI { Update our fraction... }
adc CX, DX { If it overflows, then draw a long line! }
rep stosb { Draw the run (Sets CX to zero) }
@ModifyAdd2:
add DI, 12345 { Next Y! }
@Iteration3:
add BX, SI { Update our fraction... }
adc CX, DX { If it overflows, then draw a long line! }
rep stosb { Draw the run (Sets CX to zero) }
@ModifyAdd3:
add DI, 12345 { Next Y! }
@Iteration4:
add BX, SI { Update our fraction... }
adc CX, DX { If it overflows, then draw a long line! }
rep stosb { Draw the run (Sets CX to zero) }
@ModifyAdd4:
add DI, 12345 { Next Y! }
@Iteration5:
dec AH
jns @Iteration1 { Try again }
pop BP
END;
VAR I : INTEGER;
BEGIN
InitGraph;
FOR I := 0 TO 199 DO
BEGIN
RunLine(0, 199, 319, 199-I, I AND 1);
RunLine(0, 0, 319, I, I AND 1 SHL 2);
END;
READLN;
CloseGraph;
END.