Path: chuka.playstation.co.uk!toby
From: toby@angst.forefront.com.au (Toby Sargeant)
Newsgroups: scee.yaroze.freetalk.english
Subject: Re: Speed Optimisation
Date: 24 Feb 1998 04:32:15 GMT
Organization: PlayStation Net Yaroze (SCEE)
Lines: 124
Message-ID: <slrn6f4jjk.63g.toby@ns.forefront.com.au>
References: <34F1D481.5FFF@mdx.ac.uk> <34F22032.A62E0C6C@netmagic.net>
NNTP-Posting-Host: ns.forefront.com.au
X-Newsreader: slrn (0.9.4.6 UNIX)

On Mon, 23 Feb 1998 17:19:46 -0800, Elliott Lee <tenchi@netmagic.net> wrote:
>[...]
>>         - Only draw the part of the world that is immediately visible
>> (in 3D games).
>
>If you can divide your world into logical units/buckets, you'll be able
>to save rendering time by identifying only those within a certain visual
>range.  You could set the fog paramter to obscure the farther objects so
>they don't "pop" into the visual space.  I liked Tomb Raider II's
>approach---distant objects go completely black.

Has anyone considered implementing a BSP renderer for the yaroze? The big
trick is going to be getting away, at least partially, from those all
pervasive ordering tables. One big question I guess, is whether BSP is
geared towards platforms that are render bound, rather than compute
bound.

If it can be done, though, the results could be very nice indeed. Freedom
from z sorting artifacts and intersecting polygons is a definite plus.

>>         - Define small often called functions as macros wherever
>> possible.
>
>That's great if you don't mind larger code.  Used sparingly, yeah, that
>works good.
>
>>         - Use variables which are the same size as the registers (eg
>> unsigned long) where ever possible.
>>         - Use lookup tables to replace calculations where calculations
>> are expensive (eg: floating point, trig functions etc)
>>         - Otherwise avoid any floating point arithmetic.
>
>You mean things like fixed-point tables?
>
>	[...]
>> And its possible that some traditional optimisation techniques may be
>> inappropriate:
>>         - eg: loop unrolling was a good optimisation on primitive
>> architectures but not neccessarily more modern ones.
>> 
>> Peter.
>
>Actually, some compilers (when you specify certain optimisation flags)
>will do a few tests on their own and unroll loops up to a certain
>threshhold.  If you are really desperate for speed, I suppose you could
>do a little loop unrolling...

I would imagine that you can get some speed increases out of loop unrolling
on an R3000, because of the delay slots introduced by the branch. if it's
a loop that iterates over a small amount of code a large number of times,
that delay slot can have a very large effect on the speed of execution, if
the compiler can't find anything useful to do with it. You have to weigh
this up against the size of the instruction cache, though, and I haven't
found any information about the on chip caches of the R3000.

>Something commonly overlooked is pointer dereferencing in things like
>arrays.  If you're going to be doing lots of testing, it's usually
>best to store the values into temporary variables.  e.g.
>
>Unless the compiler is really smart, every test must calculate
>the offset in the ground[][] array.  Do all the dereferencing once
>and get some good savings:

If your compiler doesn't pick this up and optimise it to death, then switch
compilers. Using gcc -O3 the following code produces ix86 code that looks
pretty much spot on in terms of possible optimisations (at least to my somewhat
untrained eye).  There's certainly no duplicated calculation.

the difference between using and not using -O3 is pretty dramatic.  each line
in the inner loop was cut from 39 lines of assembler to 3. In fact, adding
explicit veriables to hold the addresses of a[y][x] and the value of a[x][y],
the compiler produced _worse_ code at optimisation -O3 (but much better code
with optimisation off).

Depending on your target CPU, a bigger issue is cache hit rate. Effective use
of the D-Cache should be very important. I'm a bit wary of the belief that
sticking the stack in the D-Cache is the best use it can be put to. If a
section of code does a lot of manipulation of an array, for example, it would
be a big advantage to have that array stored in the cache, whereas parameters
passed to functions can often be kept in registers. having your stack in the
cache speeds up all of your code by a little bit, but theoretically, using it
to store data can speed up small sections of your code a lot. and given the of
adage that 10% of the code takes 90% of the time, speeding up that 10% a lot is
much more useful.

The other thing is, of course, that 90% of the possible optimisations in almost
any code are at the level of algorithms and data structures. There's very
little point going after the 10% until you're sure that you've already
optimised the first 90%. When optimising the last 10%, gcc -S is your friend.
It'll produce machine code from your .c files in the corresponding .s files.
matching the two together, it becomes pretty obvious where the compiler is
producing bad code. Then, just play around until the compiler generates better
code, or rewite it yourself in assembler.

main() {
   int x,y;
   static int a[10][10];

   for (x=0;x<9;x++) {
      for (y=0;y<9;y++) {
         a[x][y]=(x+y)%5;
      }
   }
   for (x=0;x<9;x++) {
      for (y=0;y<9;y++) {
         if (a[x][y]==0) a[y][x]++;
         if (a[x][y]==1) a[y][x]+=2;
         if (a[x][y]==2) a[y][x]+=3;
         if (a[x][y]==3) a[y][x]+=4;
         if (a[x][y]==4) a[y][x]+=5;
         if (a[x][y]>5) a[y][x]=0;
      }
   }
}

>
>My $0.02,
>- e!
>  tenchi@netmagic.net
>  http://www.netmagic.net/~tenchi/yaroze/

and mine..

Toby. (S, rather than H :) )