A Short Guide to 3D Graphics Performance testing
- 1 Understand the system
- 2 Find the bottleneck
- 3 Eliminate the bottleneck
- 4 Rinse and repeat
- 5 This is an over-simplification
- 6 Low-end hardware
Understand the system
In most 3D applications - whether under OpenGL, OpenGLES, WebGL or Direct3D, there are four principle places where speed bottlenecks can happen:
- The CPU - you have to calculate what meshes you're going to draw and set them up for rendering. This tends to be more or less a fixed cost per mesh.
- The transmission link between CPU and GPU which is a cost that depends on the number of vertices you send multiplied by the number of per-vertex attributes - plus the cost of updating textures and shaders that you might change during that frame.
- The GPU's vertex processor. This is the per-vertex processing cost of transforming/lighting your vertex data - without shaders, the cost roughly depends on the number of vertices times the number of lights you have turned on - with shaders, the cost roughly depends on the number of vertices times the complexity of your shader.
- The GPU's pixel/fragment processor. This is the per-pixel cost for pixels that pass clipping. The cost roughly depends on the number of pixels you draw onto the screen times the number of textures you use and/or the complexity of your fragment shader.
In general, these four parts of the process are happening in parallel. While the GPU's pixel processor is drawing the pixels of one set of triangles, the GPU vertex processor is transforming the vertices of the ones that came along a little later, the transmission system is shoving the next set of vertices after that into the input queue of the GPU - and the CPU is working on setting up the next mesh to draw.
When one of these four systems is processing ,eshes slower than the others, the ones before it will be stuck waiting for it to complete - and the ones after it will be idling while they wait for more polygons to process.
Since all four systems are working in parallel - you can't simply use the CPU clock to time how long a particular mesh takes to draw because if something other than the CPU is the bottleneck, you may be timing the time it took for some previous mesh to draw. In the end, you really can't disentangle the timings like that.
The best thing you can do is to measure the long-term average frame rate of your system. Render (say) 100 frames and calculate the average time per frame.
Find the bottleneck
If your application is running slower than you'd hoped - then you need to establish which of these four things is the biggest problem.
GPU Pixel processing
Pixel processing time is easy to understand - reduce the size of the window you're rendering to (keeping everything else the same). If your program goes faster in rough proportion to the area of the window (height x width) - then pixel processing is the bottleneck.
If you have eliminated that then since CPU time generally doesn't depend on the number of vertices you draw, then you can (just as a test) keep rendering to a tiny window (to more or less eliminate pixel processing costs) and deliberately halve the number of triangles in each mesh. If your application's performance increases by roughly a factor of two then you were obviously not limited by the CPU's per-mesh costs - if your performance hardly changes - then probably you're drawing too many objects or doing too much per-mesh work in the CPU and you need to improve your code somehow.
GPU Vertex transmission/processing
Figuring out whether the transmission costs (2) or the GPU's vertex processing (3) costs are your problem is tricky. But since both depend mostly on number of vertices, you probably don't need to.
Eliminate the bottleneck
If CPU time is the culprit
You either need to optimize your code so that other CPU time-sinks are reduced (eg make physics, collision, AI, etc faster) - or you need to improve your field-of-view culling so you draw fewer meshes that are off-screen - or you need to reduce the number of meshes in your art (eg by combining multiple parts of an object into a single object using tricks like texture atlassing).
In my experience, the last of these is the first thing most people should be looking at.
If GPU vertex transmission/processing is the culprit
Then your meshes are too complex or you have too many (fixed function) light sources or your vertex shader is too complex. Use level-of-detail to reduce the number of vertices in meshes that are further from the eyepoint. Consider doing occlusion queries to reduce the number of meshes you draw. Optimize light source culling.
If GPU pixel processing is the culprit
Simplify your pixel shaders. Consider doing a depth-only pass before your 'beauty' pass. Can you reduce the resolution of your textures? Can you draw to a smaller window? Can you make better use of approximate front-to-back rendering order?
Rinse and repeat
When you've done some kind of optimization - re-do the testing phase to see if you've speeded things up - and also (very important) to see if you moved the bottleneck somewhere else. If the CPU was the limiting factor - and you improved that, then perhaps the GPU is now the limiting factor. If so, then speeding up the code in your CPU still more probably won't help - and (worse still) doing more work there won't result in better frame rates. So after each round of optimization, see which part of the system needs more work next.
Also, if you're getting good frame rates - then you can probably improve the quality of your graphics by drawing more meshes, more polygons, or having more complicated lighting or something. When things are humming along fast enough, you can use "reverse optimization" to understand where you could be making things look nicer at little to no cost to performance.
This is an over-simplification
There are times when (for example) your CPU is spending too long getting to the point in the frame cycle where it starts drawing meshes...then, once it starts drawing them, the meshes are too complicated and the CPU is held up waiting for the GPU to get done. In such cases, you have multiple bottlenecks and improving either CPU or vertex count will both improve performance.
On things like cellphones and very low end Intel GPU's, it's likely that the "GPU vertex processing" stage is actually happening in software on the CPU. In this case, the cost per-vertex for "transmission" is essentially zero because everything is sitting in the CPU's main memory - and the per-vertex cost for "GPU vertex processing" is actually slowing the CPU down.
In such circumstances, you can still figure out whether it's per-pixel costs (by shrinking the window size down) - but halving the number of vertices in each mesh won't tell you much about whether you have too many vertices or too many meshes. In a sense, it doesn't matter because improving one will relieve the situation for the other - and whichever one is easiest to improve should be the one you attack first.