Speeding up SetPixel

Table of Contents:
Review and reminders
Getting rid of the multiply
Using Shifts
Using a Table
A example program for pixels


Review and Reminders

In the last installment, we learned how to set the graphics mode and draw pixels. Although the routine we provided was adequate, it was no where near fast enough for real time games. Here is where we start squeezing cycles out of the Turbo Pascal compiler. Then we convert it to inline assembler, and watch it scream!


Getting rid of the multiply

To optimize for speed, simpler is better. Below is the routine that we used last time to set a pixel:
PROCEDURE SetPixel(X, Y : INTEGER; Color : BYTE);
BEGIN
  Screen^[Y*320+X] := Color;
END;
As you can see, this is only one line of Pascal code. Although this may seem to be very simple, there is an operation that uses up most of the time spent in the routine. That is the *. Multiplying on a 486 costs a wopping 26 cycles. If we can rid ourselves of this multiply, we can greatly speed this up. There are two ways to remove the multiply:

Each is examined below.


Using Shifts

One way to kill some of the cycles involved in the shift is to use shifts. In assembly language on a 486, they only cost two cycles a piece. Since each SHL (shift left) by one multiplies by two, I will use the following formula:

  Address := (Y SHL 8)+(Y SHL 6)+X;

  Y SHL 8 = Y*256
  Y SHL 6 = Y*64

  Y*256+Y*64 = Y*(256+64) = Y*320
Simple, isn't it? Well, now to convert it to assembly language, we must make a decision. Should we support the 8086 or not? If we drop 8086 support, we can realize a great speed increase. The 286 supports combined shifts. This means that...

Instead of this:

shl AX, 1   ;3 cycles
shl AX, 1   ;3 cycles
shl AX, 1   ;3 cycles
shl AX, 1   ;3 cycles
shl AX, 1   ;3 cycles
shl AX, 1   ;3 cycles
            ;18 cycles total
We can use this:

shl AX, 6   ;2 cycles
            ;2 cycles total
Well, I have decided to drop the 8086, since who has one anymore anyways! (Those people that do do not expect realtime graphics anyway) Below is the assembly language version of this routine. Notice that I combine shifts to save repetitive shifting:

PROCEDURE SetPixel(X, Y : INTEGER; Color : BYTE); ASSEMBLER;
ASM
  les DI, Screen    { load the Screen buffer pointer } 
  add DI, X         { Add in the X offset }

  mov AX, Y         { Load Y into AX      }
  shl AX, 6         { AX = Y*64           }
  add DI, AX        { DI = X+Y*64         }
  shl AX, 2         { AX = Y*64*4 = Y*256 }
  add DI, AX        { DI = X+Y*64+Y*256   }

  mov AL, Color     { Load the color into AX } 
  mov ES:[DI], AL   { Write to the VGA }
END;
This is a huge improvement, but there still is more that can be done. With tables, we can remove the shifts entirely, and therefore take less time. To fill a screen, we must call this routine 64,000 times. For each cycle that we can eek out of this procedure, we save our program 64,000 cycles that can be put to better use.


Using Tables

Another way to get rid of the multiply is to precalculate 320*Y. We store the result in a table with one entry for each Y value. From the start, I must point out that this uses more memory (640 bytes), but it adds flexibility and speed. This is the method that we will use for the rest of the series...

To start, we need to initialize a table. We can define it as follows:

VAR
  ScreenY : ARRAY[0..199] OF WORD;   {Y can range from 0 - 199 }
And we can precalculate it with this routine:
PROCEDURE CalcScreenY(Width : WORD);  { Allow for future expansion } 
VAR I : INTEGER;
BEGIN
  FOR I := 0 TO 199 DO
    ScreenY[I] := I*Width;
END;
Although this routine could be sped up by incremental calculations, it only happens at the initialization of our program, so it would not be worth the extra confusion it might cause. Anyway, we need to call this routine from InitGraph, so InitGraph becomes this:
PROCEDURE InitGraph;
BEGIN
  ASM
    mov AX, 0013h    { Function 0, mode 13h }
    int 10h
  END;
  Screen := PTR($A000, 0);   { Set up the screen buffer pointer }
  CalcScreenY(320);
END;
And now our SetPixel routine becomes this:
PROCEDURE SetPixel(X, Y : INTEGER; Color : BYTE);
BEGIN
  Screen^[ScreenY[Y]+X] := Color;
END;
This is much more elegant than our previous two routines, because it lends itself well to assembler. There is no multiplication, so all we have is adds. The following is the optimized version of the routine:
PROCEDURE SetPixel(X, Y : INTEGER; Color : BYTE); ASSEMBLER;
ASM
  les DI, Screen
  add DI, X
  mov BX, Y       { Y is an index into the ScreenY array        }
  add BX, BX      { Multipy by two because each entry is a word }
  add DI, DS:[BX+OFFSET ScreenY]

  mov AL, Color
  mov ES:[DI], AL
END;
THIS is a very fast routine. By using:
  add BX, BX   ;1 cycle
instead of:
  shl BX, 1    ;3 cycles
We save two cycles. This procedure is all about sacrificing memory for speed. It always comes down to a conflict between the memory needs of a program and the speed needs. YOU, the programmer must make decisions based on how reasonable it would be to sacrifice the memory. For instance, we could have made a VERY fast routine by having a 2 dimensional array. Just plug in X and Y, and get out an address. This would have taken up 128k of memory and I didn't think that that was a reasonable sacrifice.


Example Program

The program below is the same program from previous lessons, except that now we use the table method of puting pixels. At this point, the RANDOM function becomes the bottleneck of the program. Here it is:

-----------------] Example Starts here [-----------------
PROGRAM PixelExample;
USES CRT;   { Include the "KeyPressed" function }
TYPE
  ScreenBufferType = ARRAY[0..63999] OF BYTE;
  ScreenBufferPtr = ^ScreenBufferType;

VAR
  Screen : ScreenBufferPtr; 
  ScreenY : ARRAY[0..199] OF WORD;   {Y can range from 0 - 199 }

PROCEDURE CalcScreenY(Width : WORD);  { Allow for future expansion } 
VAR I : INTEGER;
BEGIN
  FOR I := 0 TO 199 DO
    ScreenY[I] := I*Width;
END;

PROCEDURE InitGraph;
BEGIN
  ASM
    mov AX, 0013h    { Function 0, mode 13h }
    int 10h
  END;
  Screen := PTR($A000, 0);   { Set up the screen buffer pointer }
  CalcScreenY(320);
END;

PROCEDURE CloseGraph; ASSEMBLER;
ASM
  mov AX, 0003h      { Function 0, Mode 3}
  int 10h
END;

PROCEDURE SetPixel(X, Y : INTEGER; Color : BYTE); ASSEMBLER;
ASM
  les DI, Screen
  add DI, X
  mov BX, Y       { Y is an index into the ScreenY array        }
  add BX, BX      { Multipy by two because each entry is a word }
  add DI, DS:[BX+OFFSET ScreenY]

  mov AL, Color
  mov ES:[DI], AL
END;

BEGIN
  InitGraph;
  WHILE NOT KeyPressed DO
    SetPixel(
       RANDOM(320),     { Random X (Range 0-319)     }
       RANDOM(200),     { Random Y (Range 0-199)     }
       RANDOM(256));    { Random Color (Range 0-255) } 
  CloseGraph;
END.
-----------------] Example Ends here [-----------------


  • Created by Chris Lattner