Profiling, timers, compiled code

Optimizing GPU code, as for parallel programming (but worse), is difficult and unintuitive. Several tools can help in this task.
Alas, none work in webGL, only on desktop. But you can easily port your webGLSL to desktop, and even more easily using some glue tools like libShadertoy (see page Applications compatible with Shadertoy ).

Profiling tools:

nVidia insight (different features depending you want the windows VisualStudio, linux Eclipse, or standalone version).

Getting timers:

  • C-side: timer Query tells you the precise real duration (in ms) of rendering of your drawcall.
  • GLSL-side: Shader clock lets you mesure the current SM clock (in ticks) any time you want in your shader.
    It’s often useful to check the current SM ID, warp ID, and corresponding ranges.

Getting compiled code:

two methods:

  • Compile the shader apart, using cgc compiler or shaderPlayground app(windows).
    Problem: you must choose the target language (openGL, openGL-ES, webGL, …) and language version. Hard to be sure which will be used in your app, especially for webGL in a browser.
  • Getting the assembly from your app: GetProgramBinary()

Now, it’s always interesting to see the generated code, but it’s often not easy to deduce right things from it in terms of how to optimize (apart for very basic arithmetic or algorithmic). For instance key aspects for perfs are number of registers used (because it constraints how many alternate threads can be piled-up to mask wait-states), divergence (conditionally executed parts that will differ for different warp pixels), consequence of wait-states (because waiting for maths or texture or worse, dependent chain – that optimizer can improved by shuffling commands), all things that not easily read in the code or that optimizer could improve by producing a code looking strange at first glance. Also, optimizer tend to produce long code by unrolling everything in order to resolve more operations at compilation time, but this can also yields apparently ugly complex code nonetheless more performant.

In the scope of webGL, remind that upstream of that windows will by default transpile the code to HLSL, using different version of D3D depending you run Firefox or Chrome, instead of GLSL compilation. And the layer Angle transforms your GLSL code to try to fix errors and instabilities of webGL, but different browsers activates different Angle patches or don’t use it at all.
On the other end, the “assembly code” is indeed an intermediate code, that is compiled further in the GPU.
That’s why profiling tools and timers are probably more useful for optimization 😉

Leave a comment