Case study: making dot pattern loopless

Many beginner tends to program in Shadertoy as they would program in C: by explicitly drawing every elements. But shaders are called for every pixels so drawing the full scene each time can be very costly, and it’s way more efficient to determine what is the only – or the few – element(s) that may cover the current pixel, if feasible. Which is often the case for repetitive patterns, e.g. a spiral of dots. Once the candidate element is determined, then we can draw it (which might by itself be costly, if it was not a simple dot like here).

spiral

Let see how we can get to one form of algorithm to the other.
A Shadertoy user once proposed a “C like” shader similar to this:

#define rot(a) mat2( cos(a),-sin(a),sin(a),cos(a) )

void mainImage( out vec4 O, vec2 u ) {
    vec2 R = iResolution.xy,
    U = ( 2.*u - R ) / R.y,
    p = vec2(.01,0);

    float PI = 3.14159,
        phi = iTime * .01 + 0.1, // or (1. + sqrt(5.))/2.,    
          a = phi * 2.*PI,
          d = 1e9;

    for( int i = 0; i < 1400; i++) {
    	   d = min( d, length(U - p) - .001 );
           p *= rot(a);
           p = normalize(p) * ( length(p) + .0015 );
    }
    
    O = vec4( smoothstep(3./iResolution.y, 0., d - .01) );
}

p starts at vec2(.01,0), then at each step it is rotated by a and its distance to center increased by .0015. The loop is called 1400 times for every pixel, which can be quite costly (especially in fullscreen), while at most one dot will cover the pixel. May we determine which ?

 

First, let makes the current coordinates explicit, from the description above (NB: I personally prefer to loop on floats to avoid loads of casting):

p = ( .01+ .0015*i ) *CS(i*a);
with 
#define CS(a)  vec2(cos(a),sin(a))

Now, which i yields the dot closest to U ?

Let’s do the maths !

p ~= U considered in polar coordinates yields .01+ .0015*i ~= length(U)  and  i*a ~= atan(U.y,U.x)
The first gives i0 ~= ( length(U) - .01 ) / .0015  and the second gives i1 ~= atan(U.y,U.x) / a , or indeed, i1' = i1 +k*2.*PI/a where k is the – unknown – spire number.
We can find k such that i0 ~= i1'k ~= ( i0 - i1 ) / (2.*PI/a ) .
k must be integer, so we should consider either the floor or cell of this. round would do it for a small dot, but with the larger one here we will need both values to cover the full dot (and for an even larger one we should visit a bit more integers around).
Then, i = round( i1 + k * 2.*PI/a )

 

Which gives the loopless version of the shader ( the 2-values loop n is just to parse the floor and cell ) :

#define CS(a)  vec2(cos(a),sin(a))

void mainImage( out vec4 O, vec2 u )
{
    vec2 R = iResolution.xy, 
         U = ( 2.*u - R ) / R.y;

    float PI = 3.14159,
        phi = iTime*0.001 + 0.1, // or phi = (1. + sqrt(5.))/2.,
          a = phi * 2.*PI,
         i0 = ( length(U) - .01 ) /.0015,
         i1 = ( mod( atan(U.y,U.x) ,2.*PI) )/ a, // + k*2PI/a
          k = floor( (i0-i1) / (2.*PI/a) ), 
          i, d = 1e9;
    
    for (float n = 0.; n < 2.; n++) {
        i = round( i1 + k++ * 2.*PI/a );
        vec2 p = ( .01+ 0.0015*i ) *CS(i*a);
    	d = min( d, length(U - p) - .001 );   
    }
  
    O = vec4(smoothstep(3./iResolution.y, 0., d - .01));
}

Not only the cost per pixel is hugely decreased (by about 1000), but also the cost is now totally independent to the number of dots !  Which is good news when designing a pattern.

See more ( including simpler ) loopless shaders.

 

Note that when designing a spiral shader “the procedural way” from scratch, we could follow a totally different strategy: converting U to spiral coordinates, then splitting it in chunks using fract. The not-trivial part is then the drawing of the dot, since it must not be done in spiral coordinates or it would appear deformed. So we must convert back the center and neighborhood to screen coordinates. See example.

Advertisements

Programming tricks in Shadertoy / GLSL

Many people start writing GLSL shaders as they would write C/C++ programs, not accounting for the fact that shaders are massively parallel computing + GLSL language offers many useful goodies such as vector ops and more. Or, not knowing how to deal with some basic issues, many people copy-paste very ugly or unadapted code designs, or re-invent the wheel (generally ending on less-good solutions than the evolutionary polished ones 😉 ).

Here we address some basic patterns/tasks. For even more basic aspects related to the good use of GLSL language and parallelism, please first read usual-tricks-in-shadertoy/GLSL . And for non-intuitive issues causing huge cost (or even crashes) at runtime or compilation time, read avoiding-compiler-crash-or-endless-compilation .

Normalizing coordinates

The window is a rectangle that can have various sizes: icon, sub-window in browser tab, fullscreen, seen from different computers (including tablets and smartphones), and different aspect ratio (fullscreen vs sub-window, or in different hardwares including smartphones long screens in landscape or portrait mode). So we usually start by normalizing coordinates. For some reason, many people use a very ugly pattern of first normalizing+distorting the window coordinates to [0,1]x[0,1] ( – 0.5 if centered) then applying an aspect ratio to undistort. Basic clean solutions are:

vec2 R = iResolution.xy,
  // U = fragCoord / R.y;                     // [0,1] vertically
     U = ( 2.*fragCoord - R ) / R.y;          // [-1,1] vertically
  // U = ( fragCoord - .5*R ) / R.y;          // [-1/2,1/2] vertically
  // U = ( 2.*fragCoord - R ) / min(R.x,R.y); // [-1,1] along the shortest side

Displaying textures and videos

Note that if you want to map an image on the full window, thus with distortions, you then do need to use fragCoord/R.
But if you want to map un undistorted rectangle image – typically, a video – , things are a little more involved: see here. Since typical video ratio is accidentally not too far to window ratio (on regular screen) most people blindspotly relied on the “map to full window” above, but on smartphones it then look totally distorted.
( Note that texelFetch avoids texture distortion on a simpler way, but then you no longer benefit from hardware interpolate, rescale, wrap features. )

Managing colors

Don’t forget sRGB / gamma !

Don’t forget that image textures and videos channel intensities are encoded in sRGB, and that final shader color is to be reencoded by you in sRGB, while most synthesis and treatments done in shaders are assumed to be in flat space.
This is especially important for antialiasing since returning 0.5 is really not perceived as mid-grey (test here), for color interpolation (see counter-example, and another), and for luminance computation of textures images and video (NB: this encoding of intensity was historically chosen to account for non-linear intensity distortion in CRT screens, as perception-based cheap compression, then as a normalization to understand colors the same way through multiple input and output devices).
Fortunately sRGB is close to gamma 2.2 conversion: do fragColor = pow(col, vec4(1./2.2) ) at the very end of your program, and col = pow(tex,vec4(2.2)) after reading a texture image to be treated or combined (this does not apply to noise textures). Note that just doing fragColor = sqrt(col), resp. col = tex*tex, is a pretty good approximation.

Hue

Many people rely on full costly RGB2HSV conversion just to get a hue value.
This can be made a lot simpler using (see ref):

#define hue(v) ( .6 + .6 * cos( 2.*PI*(v) + vec4(0,-2.*PI/3.,2.*PI/3.,0) ) )  // looks better with a bit of saturation
// code golfed version:
// #define hue(v) ( .6 + .6 * cos( 6.3*(v) + vec4(0,23,21,0) ) )

For full RGB2HSV/HSL and back, see classical and iq references.

Drawing thick bars

step( x0, x ) transitions from 0 to 1 at x0.
smoothstep( .0 , .01, x-x0 ) does the same with smooth transition.
To make a thick bar, rather than multiplying a 0-to-1 with a 1-to-0 transition, just do:

step(r/2., abs(x-x0) )
smoothstep(.0, .01 , abs(x-x0)-r/2. )  // smooth version

NB: above, 1 is outside. If you want 1 inside use 1.- above, or:

step( abs(x-x0), r/2 )
smoothstep( .01, .0,  abs(x-x0)-r/2. )  // smooth version

Antialiasing

Aliasing in space or in time is ugly and make your shader looking very newbie 😀 . Oversampling inside each pixel is very costly and gives not-so-good improvement but with hundreds samples per pixel. For algorithms like ray-tracing you have little alternatives (but complex techniques like game-programming screen-space time-based denoising). But for simple 2D shaders it’s often easy to have very good antialiasing for almost free, by using 1-pixel-smooth transitions at all boundaries: More generally, the idea is to return a floating point “normalized distance” rather than an binary “inside or outside”.
Typically, instead of if (x>x0) v=0.; else v=1. ( or  v = x>x0 ? 0. : 1. ), which are equivalent to v=step( x0, x ) , just use v = smoothstep( x0-pix, x0+pix, x ) where pix is the pixel width measured with your coordinates (e.g. pix=2./R.y if vertical coord is normalized to [-1,1]). ( Or simply clamp( (x-x0)/(2.*pix) ,0., 1.) . Note that smoothstep eats part of the transition interval so you need to compensate using at least pix = 1.5*pixelWidth. ). cf Example code.

// antialiased 2D ring or 1D bar of radius r around v0. (2D disc: v0 = 0 )
#define S(v,v0,r)  smoothstep( 1.5/R.y, -1.5/R.y, length(v-(v0)) - (r) )

When you see magic numbers like 0.01 in smoothsteps tell the code author that it won’t scale (aliased in icon, blurry in fullscreen) and tell them to just use true pixel width instead. Note that for 1 pixel thin features, result will look aliased if you forget the final sRGB  conversion at the end of the shader.

Nastier functions are  floor , fract and mod since there is no simple way(*) to smooth their discontinuity the same way we did for step. Still, these are often used with some final thresholding, that just have to not be right on the discontinuity: e.g.,  fract(x+.5)-.5 has no longer discontinuity at x = 0 (or at x = integer).
(*) :  E.g. 1: see smoothfloor/smoothfract . E.g. 2: you might sometime use clamp( sin(Pi*x)/Pi / pix, 0.,1. ) instead of int(x)%2 .

If the parameter value is not a simple scaling of coordinates it can be difficult to know the pixel size in these units. But GLSL hardware derivatives can do it for you: pix = fwidth(x) , at least if x is not crazily oscillating faster than pixel rate. But then as a derivative any discontinuity will cause an issue while you were only interested in the coarse gradient. If x contains discontinuities like x=fract(x’) or x=mod(x’), then simply use x’ instead of x in fwidth since it’s just the same gradient without the discontinuity. cf Example code.

Drawing lines

People solved this long ago, so you don’t need to reinvent the wheel 😉 .
The principle is to return the distance to a segment, then to use the “antialiased thick bar” trick above (cf #define S). Note that for a complex drawing you can first compute the min distance to all features then apply the antialiased-bar (and optional coloring) at the very end. You might even use dot(,) rather than length() so as to compute sqrt only once.

float line(vec2 p, vec2 a,vec2 b) { // --- distance to segment with caps
    p -= a, b -= a;
    float h = clamp(dot(p, b) / dot(b, b), 0., 1.);// proj coord on line
    return length(p - b * h);                      // dist to segment
    // We might directly return smoothstep( 3./R.y, 0., dist),
    //     but its more efficient to factor all lines.
    // We can even return dot(,) and take sqrt at the end of polyline:
    // p -= b*h; return dot(p,p);
}

Depending on the use case, you might want the distance to an isolated segment (including caps at ends) or just to the capless segment.  cf Example code.

Blending / compositing

When you splat semi-transparent objects, or once you use antialiasing, rather than setting or adding colors you must compose these semi-transparent layers or you will suffer artifacts.
Below, C is pure object color in RGB and opacity in A, O is current and final color.

Drawing assumed to be from front to back stage (i.e. closest first):
(which allows to stop as soon as opacity is 100% or above some threshold like 99.5%)

O += (1.-O.a) * vec4( C.rgb, 1 ) *C.a;

Drawing assumed to be from back to front stage (i.e. closest last):

O = mix( O, vec4( C.rgb, 1), C.a );

Vector maths

First, a reminder that GLSL directly knows about vectors, matrices, vector geometry operations, blending operations; even most ordinary math functions do work on vectors: see here. Besides geometry, vector can also be used for RGBA colors, for complex numbers, etc. Each time you want to do the same thing on x,y,z (for instance), use them ! The perf won’t be a lot better, but the readability of the code will be a lot more, comprising the reasoning, bug chasing, code evolution.

In addition it’s often convenient to add some more vector constructors like:

#define CS(a)        vec2( cos(a), sin(a) )
#define cart2pol(U)  vec2( length(U), atan((U).y,(U).x) )
#define pol2cart(U) ( (U).x * CS( (U).y ) )

Some operations on complexes: ( vec2 Z  means  Z.x + i Z.y  )

// add, sub;  mul or div by float : just use +, -, *, /
#define cmod(Z)     length(Z)
#define carg(Z)     atan( (Z).y, (Z).x )
#define cmul(A,B) ( mat2( A, -(A).y, (A).x ) * (B) )  // by deMoivre formula
#define cinv(Z)   ( vec2( (Z).x, -(Z).y ) / dot(Z,Z) ) 
#define cdiv(A,B)   cmul( A, cinv(B) )
#define cpow(Z,v)   pol2cart( vec2( pow(cmod(Z),v) , (v) * carg(Z) ) )
#define cexp(Z)     pol2cart( vec2( exp((Z).x), (Z).y ) )
#define clog(Z)     vec2( log(cmod(Z)), carg(Z) )

Rotations:
the simplest is to just return the 2D matrix (even for 3D axial rotations):

#define rot(a)      mat2( cos(a), -sin(a), sin(a), cos(a) )
// use cases:
vec2 v = ... ; v *= rot(a); // attention: left-multiply reverses angle 
vec3 p = ... ; p.xy *= rot(a.z); p.yz*= rot(a.x); ...

Note that the optimizer recognizes identical formulas and won’t evaluate sin and cos twice.

Just for fun, the code golfed version 🙂 :  mat2( cos( a + vec4(0,33,11,0)) )

Computing random values

    • Sometime we need the equivalent of drand(), i.e. linear congruence series, that can easily be reimplemented explicitely. cf wikipedia.
    • But most of the time what we really need is a hash value, i.e. a different random value for each pixel, or grid cell, or 3D coord, or 2D+time, etc. And this hash might be a scalar or a vector.
      • For simple use cases, you might rely on the shadertoy 2D or 3D noise textures in grey or RGBA, see special-shadertoy-features . (Take care to not interpolate and reach texel centers if you really want a hash, possibly using nearest flag or texelFetch). Still, the precision is limited (8 bit textures, 64 or 256 resolution).
      • Early integer-less shading languages popularized old-school cheap float-based hashes relying on the chaotic lowest-significant bits after a non-linear operation. (The magic values are important and come from the dawn of computer science age.)
        #define hash21(p) fract(sin(dot(p, vec2(12.9898, 78.233))) * 43758.5453)
        #define hash33(p) fract(sin( (p) * mat3( 127.1,311.7,74.7 , 269.5,183.3,246.1 , 113.5,271.9,124.6) ) *43758.5453123)
        ...

        see many variants here. A problem is that precision is hardware (and compiler) dependent so random values can varies with users. Plus p must be not too small or not too big as well: on poor 16 or 24 bits hardwares the random value might just always be zero.

      • Since webGL2 we can now rely on robust precise (but a bit costlier) integer-based hashes: see reference code , especially the GlibC or NRC refs in Integer Hash – II.
        They usually eat an unsigned, so take care when casting from floats  around zero (since [u]int(-0.5) = [u]int(0.5) ).
      • Attention: the variant introduced by Perlin based on permutation tables is very inefficient in shaders since arrays and texture fetches are ultra-costly, and cascading dependent access of 3D-to-1D wrap is not pipeline-friendly as well.
    • You might not want a hash, but a continuous random noise function. Depending on your needs,
      • you might then be happy with a simple value noise (e.g. simple noise texture with interpolation, or analytic using ref codes),
      • splined value noise,
      • or more costly gradient noise (see ref codes),
      • up to full Perlin noise (gradient + spline interpolation + fractal. NB: Perlin published 3 different algorithms along time: Classical, Improved, Simplex).
        Attention: many shaders or blog named “Perlin noise” indeed just fake a simple gradient or even value noise, with random rotations through scales to mask artifacts. This might be ok for you but don’t confuse for what it is not. Conversely, it’s not a good idea for perfs to use the permutation tables for the hashes.

..

Profiling, timers, compiled code

Optimizing GPU code, as for parallel programming (but worse), is difficult and unintuitive. Several tools can help in this task.
Alas, none work in webGL, only on desktop. But you can easily port your webGLSL to desktop, and even more easily using some glue tools like libShadertoy (see page Applications compatible with Shadertoy ).

Profiling tools:

nVidia insight (different features depending you want the windows VisualStudio, linux Eclipse, or standalone version).

Getting timers:

  • C-side: timer Query tells you the precise real duration (in ms) of rendering of your drawcall.
  • GLSL-side: Shader clock lets you mesure the current SM clock (in ticks) any time you want in your shader.
    It’s often useful to check the current SM ID, warp ID, and corresponding ranges.

Getting compiled code:

two methods:

  • Compile the shader apart, using cgc compiler or shaderPlayground app(windows).
    Problem: you must choose the target language (openGL, openGL-ES, webGL, …) and language version. Hard to be sure which will be used in your app, especially for webGL in a browser.
  • Getting the assembly from your app: GetProgramBinary()

Now, it’s always interesting to see the generated code, but it’s often not easy to deduce right things from it in terms of how to optimize (apart for very basic arithmetic or algorithmic). For instance key aspects for perfs are number of registers used (because it constraints how many alternate threads can be piled-up to mask wait-states), divergence (conditionally executed parts that will differ for different warp pixels), consequence of wait-states (because waiting for maths or texture or worse, dependent chain – that optimizer can improved by shuffling commands), all things that not easily read in the code or that optimizer could improve by producing a code looking strange at first glance. Also, optimizer tend to produce long code by unrolling everything in order to resolve more operations at compilation time, but this can also yields apparently ugly complex code nonetheless more performant.

In the scope of webGL, remind that upstream of that windows will by default transpile the code to HLSL, using different version of D3D depending you run Firefox or Chrome, instead of GLSL compilation. And the layer Angle transforms your GLSL code to try to fix errors and instabilities of webGL, but different browsers activates different Angle patches or don’t use it at all.
On the other end, the “assembly code” is indeed an intermediate code, that is compiled further in the GPU.
That’s why profiling tools and timers are probably more useful for optimization 😉

Readings (shaders, maths, 3D)

People often ask where to start, and which readings to help starting or progressing.
Some just want to learn shaders, others want to get more fluent in the maths behind, some are specifically interested into 3D rendering.

Here are a sample of some online resource, free books and pay books that I often saw mentioned as helpful :
[ Disclaimer:   the purpose of this page is NOT to catalog the full list of books and webpages about 3D. Moreover, it target beginners, not university level 3D. With a focus on “graphics in fragment shaders“, as expected on a Shadertoy/GLSL blog 😉   ]

Basics:

More advanced:

Puzzling compilation errors in shadertoy

Typically, a cryptic error message appears on top of the source window, or even just a red frame around the tab name, but no command or line number is pointed.

The fact is that your source is included by ShaderToy Javascript into the true GLSL shader, with parts added before and after, and this is the real thing that is compiled: bad things can happen out of your section, even if caused by you. Also, this involves string manipulations that can also fail. In addition, the compiler in the driver can express weirdly when bad things occurs, such as exhausting of resources. This might even cause first a long freeze.

  • Nothing but the tab framed in red:

    • You probably forgot a } somewhere, and the error line doesn’t appears since it occurs… past your source, in the part that Shadertoy adds after. Indeed such {} mismatch can even sometime cause an infinite loop.

    • You played code golfing with #define mainImage: since the introduction of Common tab no error message will ever be displayed in this case, you have to guess. (But if you are in code golfing commando, you can read through the matrix so it’s not a problem 😀 )

  • Comments in #define :
    several special character like $ ‘ ” @ or UTF8 char like é ç  will cause
    Unknown Error: ERROR: 0:? : '' : syntax error

     
  • Array to big for memory… accounting the way the compiler possibly manage it ultra-badly. For instance if you do bilinear interpolation of array values, OpenGL compiler store data 4 times. Registers used for the assembly langage also count in the resource.
    Untitled.png

  • Untitled.pngBut if you are right at the limit, and possibly overwhelm the resource only because of the registers, then you can get even stranger messages with no hint at all but the hundred of error followed by whole compiled code result ! see example ( for OpenGl ).

  • Ultra long compilation time (because of you long loops and nested functions, all to be unrolled) can also result in awkward messages after some freeze time.

 

 

Embedding shadertoys in website

NB: In code snippets below, replace { by < . WordPress is unable to display code. 😦

Just as clickable image:

Copy-paste the shader URL, and build the one for the corresponding icon:

{ a href="https://www.shadertoy.com/view/SHADER_ID" >{ img src="https://www.shadertoy.com/media/shaders/SHADER_ID.jpg" /> My shader { /a >

My shader

Functional shader:

If you click the “share” button below a shader, you get the piece of code to copy-paste:

{ iframe src="https://www.shadertoy.com/embed/SHADER_ID?gui=true&t=10&paused=true&muted=false" width="640" height="360" frameborder="0" allowfullscreen="allowfullscreen" >{ /iframe >

( no example, for WordPress doesn’t accept iframes ).

… As webpage background:

To the code above in your html, just add a css file or entry telling to map the iframe as full-window background. See the 3 tabs html, css and result in  example.

Minimal version (would be better to specify a class or id name):

iframe {
 position: fixed;
 width: 100%;
 height: 100%;
 top: 0;
 right: 0;
 bottom: 0;
 left: 0;
 z-index; -1;
 pointer-events: none;
}

Fetching Shadertoy database via javascript:

See the API manual here.
You first need to register online your USER_KEY to be mentioned in your scripts. Then you can fetch queries to the Shadertoy data base to recover as JSON files lists of shader ids via search criterion, or shaders description and contents, to be used in you javascript. For instance, this is how I did my own shaders browser.

Note that only shaders saved as “public+API” can be managed. In particular, you can’t access unlisted private shaders.

Avoiding compiler crash (or endless compilation)

Sometime, shader compilation is long. Or ultra-long. Or freezing the browser. Or even crashing it after a timeout. Worse: this can happen for other peoples (often under another OS) on your shaders while it was okay for you, then your shader can be unlisted because of something you don’t experience (very frustrating).

Before suggesting solutions and what to care about, it’s important to understand…

What happens at early compilation

  • Functions do not really exist on GPU, because there is no stack to jump out then go back (that’s why recursivity is not allowed). This is just a writing aid, like macros. So all functions are first inlined.
  • Loops used to be fake as well. But even now that dynamic loops do exist, optimizers strongly prefer to keep unrolling them for performances: loop content is duplicated as many times as loop steps, with loop variable replaced by its successive const values. One problem is that optimizers don’t foresee that it might overwhelm resources (starting with final code length).
  • Branching vs divergence: when in a same warp (i.e. 32 pixels neighborhood) different conditional (“if“) branches are followed, SIMD parallelism force each thread to run them all (masking the result when not the right branch for a given thread), as shown in these demos.  For variable length loops (for, while) or early exist (conditional break in a loop) this can be even more involved.
    This firstly impact runtime performances, but branches obviously also lengthen the inlined code length (e.g. if big functions are called in branches).
    Also,  while dFdx, dFdy, fwidth might just give silly values or get unset/reset across diverging pixels, on some systems the function texture() try to do better to find the MIPmap LOD to use,  which may consist in evaluating the whole code 4 time to recover a 2×2 neighborhood on which evaluating the derivatives of texture coordinates.
  • The resulting functionless and (almost) loopless simplified but very longer GLSL code is then really compiled and optimized. But the code length and compile duration might overwhelm resources and fail, causing a crash.
  • Note that before compilation, Angle applies various code modifications to turn around some bugs occurring on some drivers/boards, then on Windows it transpiles GLSL to HLSL. And after, both shading languages are compiled into intermediate assembly language ARB, to be compiled and optimized again to get a true GPU executable. So in total there is a full stack of code rewrite and optimization.

Now, just consider the big figure: e.g. for a ray-marching code with a long stepping loop, containing branches (e.g. “if hit”) calling functions (to get the normal, the textures values, etc), that might themselves contain loops on function (e.g. for procedural texturing). Worse: the shading part launching shadow rays (or reflected/refracted rays) with a brand new marching loop (yes, it would be duplicated for each step of the main loop). In addition to the “map” function testing the whole scene for ray-intersection at every step, and this one is likely to also contain loops and further functions call.
The true code length before the true compilation is the huge combinatory of all this. You have no idea how long it could be. Well, indeed you have to.

What can we do ?

Think

  • Do you really need 1000000 steps ? sure ?
  • Do you really need to detail procedural texture (or shape) down to nanometer ? (think about where falls the pixel size limit).
  • Do you really need to compute the texture also for shadow evaluation ?
  • Can’t you first test a raw hit, then inspect the details once this step is reached ?
  • Can’t some part be done in as separate buffer (i.e., stored for the whole frame rather than evaluated for each pixel) ? BTW, does it really need to be re-evaluated at each time step ?
  • Can’t a repeated pattern be done implicitly with a simple mod/fract rather than with an explicit loop ?
  • Or can’t you find the only one (or few) items that can really meet the pixel ?

Within the unroll & inline logic

  • Deferred heavy  processing out of loops:
    Replace
        if (end_condition) { process; break; }
    with
        if (end_condition) { set_parameters; break; }
    then process the parameters after the loop.
    Typically: shading evaluation, shadows, reflected rays…
  • Deferred heavy  processing out of branches:
    Replace
        ...else if ( cond_N ) do_action(params);
    with
        ...else if ( cond_N ) set_parameters;
    then process the parameters after the loop.
  • Specialize functions, or use branches inside only if triggered by const params.
    Worst case would be an shape(P, kind, params)  implementing a whole bank of possible shapes called into a ray-marching loop: if kind is not const, the whole shape() source will be multi-duplicated.
  • Don’t call texture() in any divergence-prone area (“if” branch, variable-length or early breakable loop), at least if MIPmap is activated. Or use explicit LOD via textureLod() or textureGrad() .

Keep your critical judgement: the above advices are not always possible, and not always useful. Small loops, small processes don’t deserve special action, plus the GPU *is* powerful enough to deliver good performances on complicated code. Just, learn to recognize the coding patterns that make two “similarly complicated” shaders (by the number of lines or functions)  having totally different fate by how the compiler react. And avoid blindly following the dark path, the one nastily looking “as you would have done on CPU”.

Fighting the unroll & inline logic

You can also fight loop unrolling by making the compiler unable to know the length. E.g.:
    for (int i=0; i<N+min(0,iFrame); i++)

You can forbid optimizations [ which, exactly ? and is it really working ? ] by adding at the top of each code, or later but still outside functions definition :
    #pragma optimize(on)
    #pragma optimize(off)

Compilation can be a lot faster, but of course runtime perfs will be impacted.

 

More

See also: