Profiling, timers, compiled code

Optimizing GPU code, as for parallel programming (but worse), is difficult and unintuitive. Several tools can help in this task.
Alas, none work in webGL, only on desktop. But you can easily port your webGLSL to desktop, and even more easily using some glue tools like libShadertoy (see page Applications compatible with Shadertoy ).

Profiling tools:

nVidia insight (different features depending you want the windows VisualStudio, linux Eclipse, or standalone version).

Getting timers:

  • C-side: timer Query tells you the precise real duration (in ms) of rendering of your drawcall.
  • GLSL-side: Shader clock lets you mesure the current SM clock (in ticks) any time you want in your shader.
    It’s often useful to check the current SM ID, warp ID, and corresponding ranges.

Getting compiled code:

two methods:

  • Compile the shader apart, using cgc compiler or shaderPlayground app(windows).
    Problem: you must choose the target language (openGL, openGL-ES, webGL, …) and language version. Hard to be sure which will be used in your app, especially for webGL in a browser.
  • Getting the assembly from your app: GetProgramBinary()

Now, it’s always interesting to see the generated code, but it’s often not easy to deduce right things from it in terms of how to optimize (apart for very basic arithmetic or algorithmic). For instance key aspects for perfs are number of registers used (because it constraints how many alternate threads can be piled-up to mask wait-states), divergence (conditionally executed parts that will differ for different warp pixels), consequence of wait-states (because waiting for maths or texture or worse, dependent chain – that optimizer can improved by shuffling commands), all things that not easily read in the code or that optimizer could improve by producing a code looking strange at first glance. Also, optimizer tend to produce long code by unrolling everything in order to resolve more operations at compilation time, but this can also yields apparently ugly complex code nonetheless more performant.

In the scope of webGL, remind that upstream of that windows will by default transpile the code to HLSL, using different version of D3D depending you run Firefox or Chrome, instead of GLSL compilation. And the layer Angle transforms your GLSL code to try to fix errors and instabilities of webGL, but different browsers activates different Angle patches or don’t use it at all.
On the other end, the “assembly code” is indeed an intermediate code, that is compiled further in the GPU.
That’s why profiling tools and timers are probably more useful for optimization 😉

Advertisements

Readings (shaders, maths, 3D)

People often ask where to start, and which readings to help starting or progressing.
Some just want to learn shaders, others want to get more fluent in the maths behind, some are specifically interested into 3D rendering.

Here are a sample of some online resource, free books and pay books that I often saw mentioned as helpful :
[ Disclaimer:   the purpose of this page is NOT to catalog the full list of books and webpages about 3D. Moreover, it target beginners, not university level 3D. With a focus on “graphics in fragment shaders“, as expected on a Shadertoy/GLSL blog 😉   ]

Basics:

More advanced:

Puzzling compilation errors in shadertoy

Typically, a cryptic error message appears on top of the source window, or even just a red frame around the tab name, but no command or line number is pointed.

The fact is that your source is included by ShaderToy Javascript into the true GLSL shader, with parts added before and after, and this is the real thing that is compiled: bad things can happen out of your section, even if caused by you. Also, this involves string manipulations that can also fail. In addition, the compiler in the driver can express weirdly when bad things occurs, such as exhausting of resources. This might even cause first a long freeze.

  • Nothing but the tab framed in red:

    • You probably forgot a } somewhere, and the error line doesn’t appears since it occurs… past your source, in the part that Shadertoy adds after. Indeed such {} mismatch can even sometime cause an infinite loop.

    • You played code golfing with #define mainImage: since the introduction of Common tab no error message will ever be displayed in this case, you have to guess. (But if you are in code golfing commando, you can read through the matrix so it’s not a problem 😀 )

  • Comments in #define :
    several special character like $ ‘ ” @ or UTF8 char like é ç  will cause
    Unknown Error: ERROR: 0:? : '' : syntax error

     
  • Array to big for memory… accounting the way the compiler possibly manage it ultra-badly. For instance if you do bilinear interpolation of array values, OpenGL compiler store data 4 times. Registers used for the assembly langage also count in the resource.
    Untitled.png

  • Untitled.pngBut if you are right at the limit, and possibly overwhelm the resource only because of the registers, then you can get even stranger messages with no hint at all but the hundred of error followed by whole compiled code result ! see example ( for OpenGl ).

  • Ultra long compilation time (because of you long loops and nested functions, all to be unrolled) can also result in awkward messages after some freeze time.

 

 

Embedding shadertoys in website

NB: In code snippets below, replace { by < . WordPress is unable to display code. 😦

Just as clickable image:

Copy-paste the shader URL, and build the one for the corresponding icon:

{ a href="https://www.shadertoy.com/view/SHADER_ID" >{ img src="https://www.shadertoy.com/media/shaders/SHADER_ID.jpg" /> My shader { /a >

My shader

Functional shader:

If you click the “share” button below a shader, you get the piece of code to copy-paste:

{ iframe src="https://www.shadertoy.com/embed/SHADER_ID?gui=true&t=10&paused=true&muted=false" width="640" height="360" frameborder="0" allowfullscreen="allowfullscreen" >{ /iframe >

( no example, for WordPress doesn’t accept iframes ).

… As webpage background:

To the code above in your html, just add a css file or entry telling to map the iframe as full-window background. See the 3 tabs html, css and result in  example.

Minimal version (would be better to specify a class or id name):

iframe {
 position: fixed;
 width: 100%;
 height: 100%;
 top: 0;
 right: 0;
 bottom: 0;
 left: 0;
 z-index; -1;
 pointer-events: none;
}

Fetching Shadertoy database via javascript:

See the API manual here.
You first need to register online your USER_KEY to be mentioned in your scripts. Then you can fetch queries to the Shadertoy data base to recover as JSON files lists of shader ids via search criterion, or shaders description and contents, to be used in you javascript. For instance, this is how I did my own shaders browser.

Note that only shaders saved as “public+API” can be managed. In particular, you can’t access unlisted private shaders.

Avoiding compiler crash (or endless compilation)

Sometime, shader compilation is long. Or ultra-long. Or freezing the browser. Or even crashing it after a timeout. Worse: this can happen for other peoples (often under another OS) on your shaders while it was okay for you, then your shader can be unlisted because of something you don’t experience (very frustrating).

Before suggesting solutions and what to care about, it’s important to understand…

What happens at early compilation

  • Functions do not really exist on GPU, because there is no stack to jump out then go back (that’s why recursivity is not allowed). This is just a writing aid, like macros. So all functions are first inlined.
  • Loops used to be fake as well. But even now that dynamic loops do exist, optimizers strongly prefer to keep unrolling them for performances: loop content is duplicated as many times as loop steps, with loop variable replaced by its successive const values. One problem is that optimizers don’t foresee that it might overwhelm resources (starting with final code length).
  • Branching vs divergence: when in a same warp (i.e. 32 pixels neighborhood) different conditional (“if“) branches are followed, SIMD parallelism force each thread to run them all (masking the result when not the right branch for a given thread), as shown in these demos.  For variable length loops (for, while) or early exist (conditional break in a loop) this can be even more involved.
    This firstly impact runtime performances, but branches obviously also lengthen the inlined code length (e.g. if big functions are called in branches).
    Also,  while dFdx, dFdy, fwidth might just give silly values or get unset/reset across diverging pixels, on some systems the function texture() try to do better to find the MIPmap LOD to use,  which may consist in evaluating the whole code 4 time to recover a 2×2 neighborhood on which evaluating the derivatives of texture coordinates.
  • The resulting functionless and (almost) loopless simplified but very longer GLSL code is then really compiled and optimized. But the code length and compile duration might overwhelm resources and fail, causing a crash.
  • Note that before compilation, Angle applies various code modifications to turn around some bugs occurring on some drivers/boards, then on Windows it transpiles GLSL to HLSL. And after, both shading languages are compiled into intermediate assembly language ARB, to be compiled and optimized again to get a true GPU executable. So in total there is a full stack of code rewrite and optimization.

Now, just consider the big figure: e.g. for a ray-marching code with a long stepping loop, containing branches (e.g. “if hit”) calling functions (to get the normal, the textures values, etc), that might themselves contain loops on function (e.g. for procedural texturing). Worse: the shading part launching shadow rays (or reflected/refracted rays) with a brand new marching loop (yes, it would be duplicated for each step of the main loop). In addition to the “map” function testing the whole scene for ray-intersection at every step, and this one is likely to also contain loops and further functions call.
The true code length before the true compilation is the huge combinatory of all this. You have no idea how long it could be. Well, indeed you have to.

What can we do ?

Think

  • Do you really need 1000000 steps ? sure ?
  • Do you really need to detail procedural texture (or shape) down to nanometer ? (think about where falls the pixel size limit).
  • Do you really need to compute the texture also for shadow evaluation ?
  • Can’t you first test a raw hit, then inspect the details once this step is reached ?
  • Can’t some part be done in as separate buffer (i.e., stored for the whole frame rather than evaluated for each pixel) ? BTW, does it really need to be re-evaluated at each time step ?
  • Can’t a repeated pattern be done implicitly with a simple mod/fract rather than with an explicit loop ?
  • Or can’t you find the only one (or few) items that can really meet the pixel ?

Within the unroll & inline logic

  • Deferred heavy  processing out of loops:
    Replace
        if (end_condition) { process; break; }
    with
        if (end_condition) { set_parameters; break; }
    then process the parameters after the loop.
    Typically: shading evaluation, shadows, reflected rays…
  • Deferred heavy  processing out of branches:
    Replace
        ...else if ( cond_N ) do_action(params);
    with
        ...else if ( cond_N ) set_parameters;
    then process the parameters after the loop.
  • Specialize functions, or use branches inside only if triggered by const params.
    Worst case would be an shape(P, kind, params)  implementing a whole bank of possible shapes called into a ray-marching loop: if kind is not const, the whole shape() source will be multi-duplicated.
  • Don’t call texture() in any divergence-prone area (“if” branch, variable-length or early breakable loop), at least if MIPmap is activated. Or use explicit LOD via textureLod() or textureGrad() .

Keep your critical judgement: the above advices are not always possible, and not always useful. Small loops, small processes don’t deserve special action, plus the GPU *is* powerful enough to deliver good performances on complicated code. Just, learn to recognize the coding patterns that make two “similarly complicated” shaders (by the number of lines or functions)  having totally different fate by how the compiler react. And avoid blindly following the dark path, the one nastily looking “as you would have done on CPU”.

Fighting the unroll & inline logic

You can also fight loop unrolling by making the compiler unable to know the length. E.g.:
    for (int i=0; i<N+min(0,iFrame); i++)

You can forbid optimizations [ which, exactly ? and is it really working ? ] by adding at the top of each code, or later but still outside functions definition :
    #pragma optimize(on)
    #pragma optimize(off)

Compilation can be a lot faster, but of course runtime perfs will be impacted.

 

More

See also: Programming in GLSL is not programming in C

Playable games in Shadertoy !

Yes, even if we are stuck in pixel shader and have no access to input data, many people managed to program games in Shadertoys ! From basic to huge (might not always compile on your machine 🙂 ), from pro to amateur, 108 are there :
( recommanded: HoverZoom plugin over previews )

Action games (platform, shooter, exploration)

Adventure & strategy games

Simulation, driving, sport games

Arcade & casual games

Puzzle games

Board & paper games

But so many classical games are still missing, comprising simple ones : will you dare to program one ?  🙂 ( remind to add the tag “game” ).

More:

Key shortcuts

Did you know that ShaderToy GUI as well as it’s CodeMirror shader editor had many key-shortcuts ? Seems that most people ignore it (despite GUI ones pop-up as tool-tips… if you try).

Reminders:

  • Cmd = “microsoft” or “apple” key.
  • Ctrl  might be overridden by your browser.
  • Mac: replace Ctrl by Cmd, or sometime by Alt or Ctrl-Alt.

GUI key shortcuts

  • “Down”, “Cmd Down”:  resetTime
  • “Alt Up”, “Cmd Up”:       pauseTime

  • right menu:                   save or copy image
  • “Alt+r” :                          video record

  • “Ctrl S”, “Cmd S”:         Save shader
  • “Alt Enter”, “Cmd-Enter”: Compile shader
  • “F5” :                              refresh page ( indeed it’s a browser shortcut )

Editor key shortcuts

Editor GUI

“Alt F”:                                     Editor in FullScreen
“Alt Right”, “Cmd Right”:     Go to Right Tab
“Alt Left”, “Cmd Left”:          Go to Left Tab
“Alt -“, “Cmd -“:                     decrease FontSize
“Alt =”,” Cmd =”:                   increase FontSize

Basic obvious keys

Left, Right, Up, Down,      Home (of line), End (of line),      PageUp, PageDown
Delete, Backspace, Shift-Backspace,       Tab,       Enter,       Insert

extra key shortcuts

  • basic emacs shortcut

  • “Ctrl-S”: “save”,

  • “Ctrl-F”: find,    “Ctrl-G”: find Next,    “Shift-Ctrl-G”: find Prev,
    “Shift-Ctrl-F”:  replace,    “Shift-Ctrl-R”: replace All,

  • “Ctrl-Z”: undo,    “Shift-Ctrl-Z”,”Ctrl-Y”: redo,

  • “Ctrl-A”: select All,    “Ctrl-U”: undo Selection,    “Shift-Ctrl-U”,”Alt-U”: redo Selection, “Esc”: single Selection ( ? )

  • “Ctrl-Home”,”Ctrl-Up”: Doc Start, ”   Ctrl-End”,”Ctrl-Down”: Doc End,
    “Ctrl-Left”: Word Left,    “Ctrl-Right”: Word Right,
    “Alt-Left”: Line Start,      “Alt-Right”: LineEnd,

  • “Ctrl-D”: delete Line,
    “Ctrl-Backspace”: del Word Before,    “Ctrl-Delete”: del Word After,

  • “Ctrl-[“: indent Less,    “Ctrl-]”: indent More,  ( NB: not working for me )
    “Shift-Tab”: indent Auto (and replace tabs by spaces)

 

2 more for mac only:

  • “Cmd-Backspace”: del Wrapped Line Left (?), “Cmd-Delete”: del Wrapped Line Right (?)