For comparison the original version of the algorithm is in a comment block in the source code. The best way to see how I've improved the calculation speed is to look at the code, but in summary:
In the original version of the algorithm the inner loop made 9 reads from and 5 writes to main memory. After applying the above optimisations this became 3 reads from and 5 writes to main memory plus one read from the data cache. It can be seen from the table, below, that depending on the optimisation options chosen, the program generated by the GCC runs 2-4 times faster than the one produced by CodeWarrior! At the higher optimisation levels, the most aggressive options of each compiler were used. For reference, the unoptimised version takes 566 HSyncs, compiled with GCC and optimisation level 2.
The timings for GCC clearly show that there is no tangible benefit to be had by using level 3 optimisations. GCC's level 3 also has some undesirable optimisations, such as loop unrolling. Although presented alongside one another, the table should not be read as implying that the optimisation levels in each compiler are equivalent - they are not. |