A propos

Shadertoy is one of the online tool letting you play interactively with GLSL ES shaders (and browse other’s shaders). But soon, several issues will appear:

  • How to use Shadertoy features (are you sure to really know them ? 🙂 )→ see here.
  • GLSL is not C, GLSL ES is not GLSL. How to program, and how to program well, on this calculus model ? → see here.
  • As a quite unmature technology, there are many compatibility pitfall 😦 . What are they; how to avoid them ?→ see here.
  • Shadertoy has some specificities. How  to exploit them at best ?→ see here.
  • Shadertoy is now WebGL 2.0. What’s new ?  →see here.

In this blog, we will explore some of these questions.

Disclaimer: I’m not related to the development team. I’m just a hard user of the tool 🙂

Performance tricks

[WIP]
Many things can impact performances:

Choice of algorithm and data structure

There often exists more than one way to do one thing, especially in a given situation.
→ Choose wisely. ( typically in 3D rendering, ray tracing to intersection is faster than sphere surfing that is faster than ray marching regular small steps – generally ).

For instance, some quadratic algorithms can be made linear if you split them in 2 passes, the first being stored in BuffA. ( Exemple: Gaussian filter, Fourier transform … )

Simpler sanity check: choose wisely the necessary precision and loops length 😉

Also, a reminder that shader programming is about massive parallelism where the program is called for each pixel: the worst thing is to loop on drawing each element so as you would do with CPU programming. Especially for 2D shaders, check if you can find which element may cover the pixel, or which subset of iDs.
– See making a shader loopless.
– For splatting particles, see Voronoï particle tracking.
– Regular symmetries and repetitions (in 2D or 3D): use domain folding ( using abs or mod on pixel coordinates – in linear or polar ).

Another consequence is that it makes no sense to do a shader with a first loop computing data in an array then a second loop using the data linearly: just either do everything in the same loop, or precompute the data in BuffA.

Maths

Do math simplification of formulas ( e.g. using trigonometry, geometry, algebraic properties… ), optimizers don’t know about symbolic calculus 🙂
Convertly this is useless on consts, since the compiler will evaluate them at compile time.

About factorisation of expressions, and pre or post computing what does not depend on a loop:
Yes the optimizer can do a lot of smart work you wouldn’t imagine… but sometime it doesn’t do obvious things.
→ Don’t tempt the devil, do factor and avoid redundancy. Plus it helps readability and debugging.

Shadertoy tricks

Your costly precomputations in BuffA won’t change over time ? Compute it just once:
→ if (iFrame==0) {do the computation} else fragColor = texelFetch(iChannel0, ivec2(fragCoord), 0)

Your ultra-costly shader won’t change over time ? Compute it just once:
→ if(iFrame!=0){ discard; }

In both case, this won’t work if you access a texture, because it is loaded asynchroneously: you have to first wait for it’s loading.
→ see how to.

You still want to update one of those in case of resolution change (e.g. fullscreen) ?
→ see how to. ( even simpler for mouse move or click: just test it ).

Your costly precomputations in BuffA do change over time, but don’t require full resolution ?
→ compute only in a corner of BuffA, then access it via bilinear or MIPmapping (done for you by the hardware almost for free. Just activate MIPmap filtering in the texture binder).
Similarly, you can use MIPmap to approx any spatial integration: see General Purpose MIPmap page and examples. E.g. here we compute Gaussian blur with a mix of crude sampling (using proper filter weight) and MIPmap at small scale to get the sample value (from smaller scales) – in addition, the 2D filter is separate as two 1D passes. (Attention: beyond the MIPmap approximation itself, it can be pretty biased on non-power of 2 textures. If you need precision, better rely on CubeMaps, currently the only way to get square power-of-2 textures in shadertoy :-/ ).

GPU/SIMD issues and goodies

Parallelism is powerful, but it comes with constraints. Divergence in SIMD is a big one: A neighborhood of 32 pixels ( for nVidia warps ) is computed in total sync at assembly-language instruction level, so that branching ( conditions, variable-length loops, switch, etc ) force “visiting” each configuration one after the other if one pixel in the warp enters it.
A common irrationnal belief is then to fear and avoid any “if”, and replace it by turnarounds… that can be worse (and obfuscating).
→ The problem is not the “if” per se, but how you use it (in situations that are truely divergence). Typically, doing the same thing with different parameters in the 2 diverged branches is not parallelism friendly, while just setting different parameters then applying them after the branching is.
Note that bad accounting of parallelism can also cause cause ultra-long or crashing compilation.
Read more here.

Also, on GPU there is a very huge amount of computing units that must share caches and memory accesses: the ratio is even ridiculus compare to CPU programming, so coding the same way can be desastrous for perfs. In particular, arrays can be ultra-costly (they prevent parallelism coverage of wait states by eating all the registers), and re-computing an expression can be faster than accessing it in memory or texture, possibly even if already available in cache !
→ Be wise, but also test, if it’s something critical in your case (alas, the optimal is probably GPU-dependant).

Hardware derivatives:
if you need the screenwise derivative of anything, you can get it (almost) for free using dFdx, dFdy, fwidth ! Since the SIMD sync allows the GPU to compute finite differences for you.
Still, note that the precision is limited (not-centered finite differences evaluated every two pixels, but this is ok for slow varying quantities), they are killed by divergence (be smart and compute it upstream), and are not implemented on low devices.
For instance you can often get antialiasing (almost) for free by drawing v/fwidth(v) normalized distance to shape. ( But in case of discontinuity or divergence in v you have to be a bit smarter: see here ).

( To be continued )

Galleries and collections

Art reproduction

Tag “reproduction” gathers shaders reproducing any kind of art form:
famous paintings, photos, drawings, gifs or processing animations, music albums, various tributes , logos, retro games , icons, …

examples ( assuming authors didn’t forget to set appropriate tags) :
Escher, Vasarely, Magritte, Giger, Warloh, Turner, E.Jacob, Tyler Hobbs, …
more, but with poor taggings: photos , Mondrian, origami, …

Inktober

Various sessions of inktober challenge ( one shader a day of October, within a given list )

Optical illusions

Optical illusions, testing perceptions

Misc

See also

Advanced tricks

Smart antialiasing

Antialiasing for free is easy in 2D shaders, as long as you can express shapes as distances: just blur borders over a pixel using smoothstep ( treated here ).
But this solution doesn’t work for 3D shaders, especially ray-tracing / casting / marching / surfing ones. Beside advanced complex techniques such as voxelization (since volumes are naturally smooth) or cone-tracing (many variants, quite involved), the basic solution is oversampling: consider a subresolution in each pixel and average (the theory of the perfect filtering kernel is a bit more complicated, but let keep it simple 🙂 ). This also applies to complex 2D shaders not simply based on distances.
The inconvenient is that it multiplies the cost of the shader by the sampling size N, with N = 3×3 being already costly and not perfectly anti-aliased. Still, most of the time the aliasing only occurs in a small proportion of the pixels in the window: at shape silhouettes, shadows border, highlights, etc. Oversampling these regions only should thus do the trick for not too expansive: this is adaptive sampling.

One way to implement it is to define your render function in Common tab, to call it once as usual in BufferA, and in Image tab to either just display it or add more samples if there is too much difference with neighbor pixels (your choice of criterion here). Some examples.

Another way is to relate on the marvelous hardware derivatives again (a reminder that they apply to any variable or expression and are almost free). It’s not perfect ( 2×2 block neighborhood ) but the code is then ultra simple: just rename your main as mainImage0 and append this at the end of your code:

// === easy adaptive sampling. === https://shadertoyunofficial.wordpress.com/2021/03/09/advanced-tricks/
//                           more: https://www.shadertoy.com/results?query=easy+adaptive+sampling
void mainImage(out vec4 O, vec2 U) {
    mainImage0(O,U);
    if ( fwidth(length(O)) > .01 ) {  // difference threshold between neighbor pixels
        vec4 o;
        for (int k=0; k < 9; k+= k==3?2:1 )
          { mainImage0(o,U+vec2(k%3-1,k/3-1)/3.); O += o; }
        O /= 9.;
     // O.r++;                        // uncomment to see where the oversampling occurs
    }
}

To be adapted to higher sampling if you wish. Jittering the samples would even improve this basic scheme. You might even cascade this twice via a buffer to allow higher sampling on tough areas only.

Note that if your main was ending with a sRGB conversion pow(col,1./2.2) − as it should − , move it at the end of the new main. Antialiasing looks better with the proper final sRGB space ( a reminder that otherwise a 0.5 grey don’t perceptually appears as the mean intensity between black and white ).

Antialias existing shaders easily

Variant: antialias existing shaders (if they don’t use extra Buffers) by overloading MainImage in the Common tab. See here.

Motion-blur existing shaders easily

Just overload iTime in Common to jitter it over a frame duration. Using white noise, or better, blue noise. Doing this will just time-dither the pixels, but you can combine this with antialiasing in order to integrate on inter-frame time span. See here.
Using poor-man cheap & easy blue noise:

#define iTime ( iTime + blue(gl_FragCoord.xy)* iTimeDelta )
#define H(p)  fract(sin(mod(dot(p, vec2(12.9898, 78.233)),6.283)) * 43758.5453)
#define blue(p) (  \
          (  H(p+vec2(-1,-1)) + H(p+vec2(0,-1)) + H(p+vec2(1,-1))  \
           + H(p+vec2(-1, 0)) - 8.* H( p )      + H(p+vec2(1, 0))  \
           + H(p+vec2(-1, 1)) + H(p+vec2(0, 1)) + H(p+vec2(1, 1))  \
          ) *.5/9. *2.1 +.5 )

Smart Gaussian blur

Hardware MIPmap is a cheap way to get blurred images (even on your Buffers: just activate the MIPmap filter at binding), but the filter is multi-crappy: box filter, axis-aligned, on power-of-two blocks (otherwise bilinearly interpolated), evaluated at power-of-two scales (and blending the 2 nearest levels otherwise).
Oppositely, convolving an image with a large kernel is very costly.

But many filters are smooth: you can consider them as piecewise box filters. Meaning, instead of applying them on the full resolution image, you can apply them on an intermediate MIPmap LOD: choose the quality/cost compromise you wish by setting the number of samples you want to afford, then select the LOD to use accordingly. Using LOD L means saving the cost by a factor 4L .

Second trick: Gaussians are lovely functions, and the unique family that has the separability property: a 2D Gaussian is the product of two 1D Gaussians in x and y. Meaning: you can 1D filter horizontally in a buffer, then vertically in the next buffer: cost N instead of N². On top of that, the previous trick saves by a factor 2L.

See example.

Note that hardware built-MIPmap is very buggy on non-power-of-two textures (and horribly buggy on Windows) , so very large kernels kernels might show artifacts if you use the coarsest levels. Future shadertoy implementation will allow custom image size (then chose powers-of-two) and manually built MIPmap, but for now only CubeMap(A) is MIPmap friendly up to the coarsest level.

General Purpose MIPmap

Massively parallel programming with Cuda or Compute shaders allows more complex algorithm than GLSL, for instance by cascading passes of very different resolutions. This is typically required by some not parallel-friendly operations like sorting, compacting zeros, extracting&compacting values above a level (or the local max). Managing this in parallel leads to the smart stream reduction algorithms.
In GLSL we can at least access the single built-in one: hardware built-MIPmap. The trick is trying to use that for many general purposes, considering that is allows very fast integrals on pixels blocks (axis aligned, power-of-2 blocks, for power-of-2 sizes, but still). Examples:

  • Imaging:
  • Physics / Geometry:
    • Center of Mass of a grey image ( whole image, or per blocks ):
      Make a buffer containing uv*grey. Center of mass = MIPmap(uv*I) / MIPmap(I)
    • Order-2 Moments / inertia matrix J / fitting ellipse / covariant matrix :
      Make another buffer containing vec3(u²,v²,u*v)*grey. Then Jxx = MIPmap(u²I)/MIPmap(I) - (MIPmap(uI)/MIPmap(I))² . Same for Jyy and Jxy → J = mat2(Jxx,Jxy,Jxy,Jyy)
      → Other applications: iceberg simulator
    • Surface of an object in a mask / counting set pixels : it’s just the coarsest MIPmap level.
    • Collisions / Intersection:
      Have the 2 masks in R and G channel, make B = R*G. Coarsest MIPmap gives size of intersecting surfaces : collision is > 0.
      → Some applications: packing shape with dart throwing, heat/salient map ordered algos (example)…
  • Signal processing, calculus:
    • Building a min/max pyramid:
      using ( MIPmap(I-big) )big and ( MIPmap(Ibig) )-big
      → Other applications: find extrema,…
    • Poor man Low-pass Filter, High-pass Filter, Differential of “Gaussians” filter (select a frequency band):
      just LOD, LOD – image, substract two LODs.
      → Other applications: cartooning, …
    • Cheap high quality true Gaussian filter (see above):
      Do the convolution, but on a coarser image than the full resolution.
    • Fast radial convolution: ultra-cheap, for large radial filters (but a bit blocky ),
      by summing weighted MIPmap rings.
    • Analyzing by equidistant rings (e.g. if you have to weight by f(dist) or normalize ).
    • Detect high variations in neighborhood:
      Compare buffer value to MIPmapped value (LOD depending on neighbor size to consider. For 2×2 neighborhood, you can also consider hardware derivatives, i.e., fwidth() values ).
      → Some applications: adaptive antialiasing (see above), …
    • Cheap Laplacian (see Common), ex. for Reaction-Diffusion simulation.
      principle: 9.LOD1-LOD0 ~ crown contribution. Lapl = (crown-8.LOD0)/8.

More: All General Purpose MIPmap examples.

Note that hardware built-MIPmap is very buggy on non-power-of-two textures (and horribly buggy on Windows), which makes coarsest MIPmap levels unusable on buffers. Future shadertoy implementation will allow custom image size (then chose powers-of-two) and manually built MIPmap, but for now use CubeMapA (providing six 1024×1024 buffers in half floats ) if you need correct coarsest level.

Splatting thousands of particles: Voronoï particle tracking

In Shadertoy we are stuck in fragments shaders. We are also stuck in WebGL2 / GLSL ES 3.0 (with some restriction), and thus we don’t have access to the cool GLSL 4.2 Image Load/Store feature allowing to write to any location in an output buffer: imageStore(buff, destCoord, data).

It is part of the beauty of the constraint and style of efficient massive parallelism to design procedural shaders in the pull rather than push spirit, forcing newbies to return their mind inside out as a sock rather than ultra-inefficiently programming a GPU as they code in Python, C/C++ or Processing. Now when it comes to particle-based simulations we have no choice: we have a huge bunch of elements at random evolving locations, and to render them or find their neighbor we have to search them all. And there, your smart real-time simulation is ruined by even the poorest final rendering with its monstrously long splatting loop, while generally 0 and sometime 1 particle will cover the current pixel. Can we do better ?

Relying on acceleration data-structures to structure the chaos is classical: in a first pass, store the elements in a grid, then whenever needed fetch the grid cell at the required location to get the nearby elements. But in fragment shaders, this first pass is just another splatting random data problem. Still, we have ways to do that iteratively.

Cameleon storage: One way is to try to store an element in the particle data buffer as close as possible to its real spatial location, by swapping elements along time. Then at render time you only have to fetch data in a small block around the target pixel. But this is quite impossible to make such approach robust: without coordination, several elements might want to go to the same data entry, so a same particle is better saved many time (spoil of resource, risk of overdrawing).

Voronoï particle tracking ( or 1-buffer version ) : More complicated to understand, but way more robust and efficient. The particles data buffer stays organized as usual: no special order. As acceleration structure we introduce a Voronoï buffer storing in RGBA the id of the 4 (sorted) closest particles to the current pixel location, that can directly be used at render time (use only value 1, or up to 4 if you want to manage superimposition or connectivity ). The whole point is to dynamically improve then update the choice of ids in the structure. This is done on a slow flooding basis (but surprisingly fast in practice), by testing whether the ids stored in the direct neighbor cells are even closer, and updating our local best-4 (testing the 4 neighbors seems enough). To initialize the process, and recover from occasional particle loss (e.g. when too many at the same place), we also test a few random ids (in practice just 1 per pixel and per time step is generally enough).
A side effect of the method is that we get the 1-4 closest particles at any screen pixel 🙂 . Extension to 3D is easy, but cells must be larger and thus storing 4 neighbors only allows less dense distributions than in 2D.
Beside the reference implementations above (links in title), some example: ( Michael0884 gathered a playlist. )
2D: gas (interacting with video), vortex simulation, massive collisions, SPH, massive boids swarm, Voronoï triangulation, closest graph, same in fluid particles, edge tracking, Delaunay triangulation, borders tracking with snakes.
3D: Lorentz attractor, 3D boids, 3D fluid particles, 3D SPH, 3D graph, massive collisions, cloth simulation.

( To be continued )

Simulated computers & languages

Not only some people program whole games in fragments shaders (like Quakes or others), but also some simulate whole computers or programming language interpreters !

For now there are 8 of them:

See all emulators in Shadertoy.
See all interpreters in Shadertoy.

Widgets & GUI toolkits (and more)

In their shadertoys, programmers have to handle everything, and redo everything from scratch every time. For repetitive basic utilitary features like GUI or font display it can thus be very boring, so that many just don’t embed any GUI and users have to tweak defines.
Fortunately, various community members have publish useful base elements that you can reuse !
( PS: coders, please do add the tag “gui” to your utilitary shaders: for now it gathers very little ! )

Bags of features:

  •   Super Shader GUI 98 : Windows98 inside !
    Windows (move, iconify, scroll), sliders, checkbox, color picker.

Isolated features:

Sliders & widgets :

color picker :

Mouse management:

Keyboard management:

Text & int/float numbers display:

Specials:

Graphics:

  •   Mode 7 WASD Walkaround : basic ASWD move on 3D terrain.
    ( No longer compiling: replace vec2 BUFF_RES with #define ).
    Remark: coders, do account for non-Qwerty keyboards: replace or add arrows ! (codes 37-40)
  • Missing ! : Some easy-to-reuse code for 2D and/or 3D walk-through.
    –> Which shader would you advise ?

Shadertoy media files

Sometime you need to access Shadertoy built-in media (texture, video, music…) out of Shadertoy (e.g., if porting your shader to desktop, or writing a paper or a blogpost about your shader).
Here they are:

2D textures:


NB: font generator (CC0) by Octavio Good




Cubemaps:






Volumes:

Videos:

Musics:

Classical corner cases

The following addresses more specific issues, simple or more involved:

Accessing texel center

Textures are strange arrays indexed by a float. Once rescaled by the texture size, texel centers happen to correspond to coordinates integer+0.5 , as for window pixels (but unlike iMouse coordinates). When the texture is supposed to contain non-interpolatable data, it is easy to misfetch the value. Turning the texture interpolation flag to nearest is often a bad idea since if you misfetched the precise texel location, rounding to integer can easily access to the wrong plain value. So you do have to get your formula exact !
Note that if you really want to access the texture as an array, there is now texelFetch( iChannel0, ivec2(U), lod ).  Integer index, no interpolation or wrapping (so possibly faster to), and more easy to fetch the right place in 🙂 .

Please no black icons: Demo mode on mouse-tuned shaders

Many shaders are reactive to the mouse, which is nice. But a lot of them forget that Mouse==vec4(0) at start and for icon display, leading to black or uninteresting image.
→For your mouse-tuned shader, always think about detecting the zero case and set special demo values – possibly animated – then !

Two possibilities (among many):

if (iMouse.xy==vec2(0)) set-nice-params; else set-params(mouse);
vec2 M = iMouse.xy; M = length(M)<10. ? fake-M-value(t) : normalization(M);

The second case allows the user to easily get back to demo-mode by clicking in the bottom-left corner.

One more thing about initial mouse value:
some detects that the mouse is currently clicked by the sign of iMouse.zw. Just take care that in the icon and at start iMouse.zw == 0 as well, which is not negative despite not clicked 😉 .  More details about Shadertoy iMouse uniform here.

More about icon, start image, preview

  • A reminder that live icons start at t=10″: your shader better be nice a this moment, and at least not plain black !
  • An image preview is made when you save your shader. Some people browse using previews instead of live icons (faster, safer), and previews are used at various places (e.g. when I list & link to Shadertoy games 😉 , or when you use ShadertoyPlugin to check if what you see is what the user intended).
    So when your shader get mature, a reminder to wait for a cool look to hit the “save” button !

iFrame  vs  iTime  vs  iDate(.w)

  • When you really want the viewer experience to be smooth, i.e. synchronized with his own time (camera motion, simulation), use iTime. Using iFrame would yield a motion with some stops and changing speed, that would differs among users system, and with screen resolution.
  • Speaking of simulation: in physical simulations all equations should refer to real units of space and time to account for various window resolution and shader framerate, and thus should rely on some dx,dy and dt variables. iTimeDelta let you know the real dt since the last frame. NB: it’s usually better to use a relaxation to smooth it along several frames, for isolated micro-freezes don’t corrupt the simulation: dt = mix(dt, iTimeDelta, .05); // smooth on 20 frames
  • iDate.w, giving (micro)seconds since midnight, is useful when you need to really have unique seeds even at shader initialization – along the run-time iTime is generally sufficient since FPS is never exactly regular. If you use the fractional part of iDate.w, take care that at the end of day (~86000 sec) the precision collapses.
  • iFrame is more indicated to manage internal states of your shader: initialization time, actions to delay to next frame or more, periodic switching between 2 states, etc.

Attention: in the icon, iTime starts at 10 seconds, while iFrame is really 0. This is one more reason for not using iTime to trigger initializations. It’s also a reason to take care about what your shader looks like at 10″ time, since it will be the look of your icon. (Always check the look of your icon in the “latest shaders” view).
More details about Shadertoy special timing uniforms here.

Video sync, video time

Avoid binding the same video multiple time in different buffers: it’s very easy to get each version out of sync. → store it in the first buffer, and access only this version.

Also, you can’t be sure when the video will be loaded and start so it’s very unprecise to rely on iTime for timed video effects.  If you want your effects to be well synchronized, you can use the special Shadertoy uniform iChannelTime[0].

Loading texture at initialization

In buffers, we can proceed to some initialization doing  if ( iFrame == 0 )
Alas it won’t work if you need a texture (e.g. the noise texture), since html is asynchronous and textures need a few frames to be loaded. To handle this some people do instead if ( iFrame < 30 ), but when the network’s Gods are unhappy it can take longer, and you don’t want your shader to rigidly wait at start.
The solution is to store and compare iChannelResolution[] through frames, since this ShaderToy uniform is set to texture resolution only when loaded. Example here. Note that it also allows dynamic detection of texture binding by the user.

if (u==R-.5) { O.a = iChannelResolution[1].x; return; }
if ( iFrame == 0 || texture(iChannel1, 1.-.5/R, 0. ).a == 0. ) { // init...

NB: above, testing pixel R-.5 instead of .5 allows to also detect swap to fullscreen.

Attention: you could use textureSize().xy as well, but level 0 is never 0 (set to ivec2(1) if no texture). And anyway if you need MIPmap levels at initialization it is not enough to test level 0: you must test the required (or the maximum) LOD to textureSize ! cf here.

Video or sound doesn’t loaded at start (or randomly)

This is not a bug, it’s a “feature” of the stupid new web policy about forbidding auto-play of mutimedia if user didn’t do some event (and it not even works all the time). Pressing the “rewind” button may or may not work.
The only solution is, user must quickly click-and-drag mouse or press a key in Shadertoy window before the loading is finished.
At least you can test if the locking is there by comparing iChannelResolution and textureSize: in such case the first is set (meaning the texture is around) but the second stick to 1 (for lod0), so you might at least message the user. See here.

Detecting resolution change / waiting for fullscreen

At resolution change you might want to redraw the background or reorganize data. A typical use case is wanting to let the user go fullscreen before doing the meaningful initialization. Pausing the shader 1sec at start is a quite rigid, requiring a keypress can be not transparent enough. The solution again is to store and compare iResolution through frames, Example here.
Note that detecting going to fullscreen allows you to zoom and refit content if you wish, but there is currently no possibility to do the same on size shrink, since the buffer is already trimmed. 😦

MIPmap and dFdx/dFdy/fwidth artifacts at discontinuities

Hardware derivatives are useful for antialiasing or drawing constant-thickness curves, They are also used by the hardware to compute the LOD to apply to MIPmapped textures. But if the mapping value has discontinuity at some places, the look will be ugly there, showing segments or curvy artifacts.

A classical  situation bringing discontinuity is fract(coords) or mod . ( NB: I hope you know that textures can directly repeat without using fract, by just setting the flag 🙂 .)
A solution consists in manually computing the LOD and using either textureLod() (for simple LOD level tunning) or textureGrad() (and providing the full pixel footprint). Base API (but would reproduce the artifact if derivatives used as is):
textureLod( iChannel0,  log2(length(fwidth(U*iResolution.xy))) );
textureGrad( iChannel0, U, dFdx(U), dFdy(U) );

As for what we did for antialiasing we can use hardware derivative in the task and just skip the fract or mod, since they don’t change the gradient (but by adding discontinuities). When the scaling is simple the LOD level is easily known directly, without computing derivatives. Or in some case you might prefer computing the derivatives analytically.

Another classical situation is caused by atan(y,x) (e.g. polar coordinates), at jump 2π→ 0 or -π→π . But since the jump amount is known, and easy to detect (big value while usual gradients are small) it is easy to trim it:
da = fwidth(a); if ( da > 3.) da = abs(da-2π);  cf example.

Something more puzzling: sometime the bug doesn’t show until you resize the window or go fullscreen (or somebody see it and not you). Cf example. The reason is that hardware derivatives use an approximate tricks: pixels are organized in tiles of 2 x 2 neighborhoods, and the derivatives are evaluated only once per tile as right – left for dFdx and top – bottom for dFdy. So only discontinuities occurring within a tile are detected. If the fract(coords) 1→0 jump or the polar -π→π jump happens between 2 tiles (typically for centered polar coordinates, or for texture tiled 2 x 2, with window height multiple of 4) then the artifact won’t occur. So a hacky trick could be to center your coordinates so as to control this, e.g. offsetting screen coords by 2.*floor(iResolution.xy/4.)   🙂 .

A more nasty case is the use of MIPmap or derivatives in loops, if or switch blocks. As we have seen here, SIMD parallel computation yields such situation to correspond to divergence: the neighbor pixels might be not executing the same code (i.e. being idle while the current pixel is following his combinatory branch). Since hardware derivatives are based on comparing the values of neighbor pixels, they no longer work (and can return 0, NaN or stupid things depending on the system). A call to texture() in a loop might seems harmless, but if the loop has varying length then we are in such a case: if the current pixel is looping further than one of its neighbor, MIPmap and hardware derivatives are wrong for it.
Here again the solution is to compute the LOD manually, and if using derivatives, to evaluate them before the if or switch. Or to save the useful value in the loop and compute derivative or access MIPmap after the loop.

Case study: making dot pattern loopless

Many beginner tends to program in Shadertoy as they would program in C: by explicitly drawing every elements. But shaders are called for every pixels so drawing the full scene each time can be very costly, and it’s way more efficient to determine what is the only – or the few – element(s) that may cover the current pixel, if feasible. Which is often the case for repetitive patterns, e.g. a spiral of dots. Once the candidate element is determined, then we can draw it (which might by itself be costly, if it was not a simple dot like here).

spiral

Let see how we can get to one form of algorithm to the other.
A Shadertoy user once proposed a “C like” shader similar to this:

#define rot(a) mat2( cos(a),-sin(a),sin(a),cos(a) )

void mainImage( out vec4 O, vec2 u ) {
    vec2 R = iResolution.xy,
    U = ( 2.*u - R ) / R.y,
    p = vec2(.01,0);

    float PI = 3.14159,
        phi = iTime * .01 + 0.1, // or (1. + sqrt(5.))/2.,    
          a = phi * 2.*PI,
          d = 1e9;

    for( int i = 0; i < 1400; i++) {
    	   d = min( d, length(U - p) - .001 );
           p *= rot(a);
           p = normalize(p) * ( length(p) + .0015 );
    }
    
    O = vec4( smoothstep(3./iResolution.y, 0., d - .01) );
}

p starts at vec2(.01,0), then at each step it is rotated by a and its distance to center increased by .0015. The loop is called 1400 times for every pixel, which can be quite costly (especially in fullscreen), while at most one dot will cover the pixel. May we determine which ?

 

First, let makes the current coordinates explicit, from the description above (NB: I personally prefer to loop on floats to avoid loads of casting):

p = ( .01+ .0015*i ) *CS(i*a);
with 
#define CS(a)  vec2(cos(a),sin(a))

Now, which i yields the dot closest to U ?

Let’s do the maths !

p ~= U considered in polar coordinates yields .01+ .0015*i ~= length(U)  and  i*a ~= atan(U.y,U.x)
The first gives i0 ~= ( length(U) - .01 ) / .0015  and the second gives i1 ~= atan(U.y,U.x) / a , or indeed, i1' = i1 +k*2.*PI/a where k is the – unknown – spire number.
We can find k such that i0 ~= i1':  k ~= ( i0 - i1 ) / (2.*PI/a ) .
k must be integer, so we should consider either the floor or cell of this. round would do it for a small dot, but with the larger one here we will need both values to cover the full dot (and for an even larger one we should visit a bit more integers around).
Then, i = round( i1 + k * 2.*PI/a )

 

Which gives the loopless version of the shader ( the 2-values loop n is just to parse the floor and cell ) :

#define CS(a)  vec2(cos(a),sin(a))

void mainImage( out vec4 O, vec2 u )
{
    vec2 R = iResolution.xy, 
         U = ( 2.*u - R ) / R.y;

    float PI = 3.14159,
        phi = iTime*0.001 + 0.1, // or phi = (1. + sqrt(5.))/2.,
          a = phi * 2.*PI,
         i0 = ( length(U) - .01 ) /.0015,
         i1 = ( mod( atan(U.y,U.x) ,2.*PI) )/ a, // + k*2PI/a
          k = floor( (i0-i1) / (2.*PI/a) ), 
          i, d = 1e9;
    
    for (float n = 0.; n < 2.; n++) {
        i = round( i1 + k++ * 2.*PI/a );
        vec2 p = ( .01+ 0.0015*i ) *CS(i*a);
    	d = min( d, length(U - p) - .001 );   
    }
  
    O = vec4(smoothstep(3./iResolution.y, 0., d - .01));
}

Not only the cost per pixel is hugely decreased (by about 1000), but also the cost is now totally independent to the number of dots !  Which is good news when designing a pattern.

See more ( including simpler ) loopless shaders.

 

Note that when designing a spiral shader “the procedural way” from scratch, we could follow a totally different strategy: converting U to spiral coordinates, then splitting it in chunks using fract. The not-trivial part is then the drawing of the dot, since it must not be done in spiral coordinates or it would appear deformed. So we must convert back the center and neighborhood to screen coordinates. See example.

Programming tricks in Shadertoy / GLSL

Many people start writing GLSL shaders as they would write C/C++ programs, not accounting for the fact that shaders are massively parallel computing + GLSL language offers many useful goodies such as vector ops and more. Or, not knowing how to deal with some basic issues, many people copy-paste very ugly or unadapted code designs, or re-invent the wheel (generally ending on less-good solutions than the evolutionary polished ones 😉 ). 

Here we address some basic patterns/tasks. For even more basic aspects related to the good use of GLSL language and parallelism, please first read usual-tricks-in-shadertoy/GLSL . And for non-intuitive issues causing huge cost (or even crashes) at runtime or compilation time, read avoiding-compiler-crash-or-endless-compilation .

Normalizing coordinates

The window is a rectangle that can have various sizes: icon, sub-window in browser tab, fullscreen, seen from different computers (including tablets and smartphones), and different aspect ratio (fullscreen vs sub-window, or in different hardwares including smartphones long screens in landscape or portrait mode). So we usually start by normalizing coordinates. For some reason, many people use a very ugly pattern of first normalizing+distorting the window coordinates to [0,1]x[0,1] ( – 0.5 if centered) then applying an aspect ratio to undistort. Basic clean solutions are:

vec2 R = iResolution.xy,
  // U = fragCoord / R.y;                     // [0,1] vertically
     U = ( 2.*fragCoord - R ) / R.y;          // [-1,1] vertically
  // U = ( fragCoord - .5*R ) / R.y;          // [-1/2,1/2] vertically
  // U = ( 2.*fragCoord - R ) / min(R.x,R.y); // [-1,1] along the shortest side

Displaying textures and videos

Note that if you want to map an image on the full window, thus with distortions, you then do need to use fragCoord/R.
But if you want to map un undistorted rectangle image – typically, a video – , things are a little more involved: see here. Since typical video ratio is accidentally not too far to window ratio (on regular screen) most people blindspotly relied on the “map to full window” above, but on smartphones it then look totally distorted.
( Note that texelFetch avoids texture distortion on a simpler way, but then you no longer benefit from hardware interpolate, rescale, wrap features. )

Managing colors

Don’t forget sRGB / gamma !

Don’t forget that image textures and videos channel intensities are encoded in sRGB, and that final shader color is to be reencoded by you in sRGB, while most synthesis and treatments done in shaders are assumed to be in flat space.
This is especially important for antialiasing since returning 0.5 is really not perceived as mid-grey (test here), for color interpolation (see counter-example, and another), and for luminance computation of textures images and video (NB: this encoding of intensity was historically chosen to account for non-linear intensity distortion in CRT screens, as perception-based cheap compression, then as a normalization to understand colors the same way through multiple input and output devices).
Fortunately sRGB is close to gamma 2.2 conversion: do fragColor = pow(col, vec4(1./2.2) ) at the very end of your program, and col = pow(tex,vec4(2.2)) after reading a texture image to be treated or combined (this does not apply to noise textures). Note that just doing fragColor = sqrt(col), resp. col = tex*tex, is a pretty good approximation.

Hue

Many people rely on full costly RGB2HSV conversion just to get a hue value.
This can be made a lot simpler using (see ref):

#define hue(v) ( .6 + .6 * cos( 2.*PI*(v) + vec4(0,-2.*PI/3.,2.*PI/3.,0) ) )  // looks better with a bit of saturation
// code golfed version:
// #define hue(v) ( .6 + .6 * cos( 6.3*(v) + vec4(0,23,21,0) ) )

For full RGB2HSV/HSL and back, see classical and iq references.

Drawing thick bars

step( x0, x ) transitions from 0 to 1 at x0.
smoothstep( .0 , .01, x-x0 ) does the same with smooth transition.
To make a thick bar, rather than multiplying a 0-to-1 with a 1-to-0 transition, just do:

step(r/2., abs(x-x0) )
smoothstep(.0, .01 , abs(x-x0)-r/2. )  // smooth version

NB: above, 1 is outside. If you want 1 inside use 1.- above, or:

step( abs(x-x0), r/2 )
smoothstep( .01, .0,  abs(x-x0)-r/2. )  // smooth version

Antialiasing

Aliasing in space or in time is ugly and make your shader looking very newbie 😀 . Oversampling inside each pixel is very costly and gives not-so-good improvement but with hundreds samples per pixel. For algorithms like ray-tracing you have little alternatives (but complex techniques like game-programming screen-space time-based denoising). But for simple 2D shaders it’s often easy to have very good antialiasing for almost free, by using 1-pixel-smooth transitions at all boundaries: More generally, the idea is to return a floating point “normalized distance” rather than an binary “inside or outside”.
Typically, instead of if (x>x0) v=0.; else v=1. ( or  v = x>x0 ? 0. : 1. ), which are equivalent to v=step( x0, x ) , just use v = smoothstep( x0-pix, x0+pix, x ) where pix is the pixel width measured with your coordinates (e.g. pix=2./R.y if vertical coord is normalized to [-1,1]). ( Or simply clamp( (x-x0)/(2.*pix) ,0., 1.) . Note that smoothstep eats part of the transition interval so you need to compensate using at least pix = 1.5*pixelWidth. ). cf Example code.

// Antialiased 2D ring or 1D bar of radius r around v0 (2D disc: v0 = 0 )
// normalized coords version:
#define S(v,v0,r)  smoothstep( 1.5/R.y, -1.5/R.y, length(v-(v0)) - (r) )  
// pixel coords version:
#define S(v,v0,r)  smoothstep( 1.5, -1.5, R.y* ( length(v-(v0)) - (r) )) 

( You might find the 2nd formulation more intuitive: turn v back to pixel coordinates, and proceed black/white transition over 2-3 pixels ).

When you see magic numbers like 0.01 in smoothsteps tell the code author that it won’t scale (aliased in icon, blurry in fullscreen) and tell them to just use true pixel width instead. Note that for 1 pixel thin features, result will look aliased if you forget the final sRGB  conversion at the end of the shader.

Nastier functions are  floor , fract and mod since there is no simple way(*) to smooth their discontinuity the same way we did for step. Still, these are often used with some final thresholding, that just have to not be right on the discontinuity: e.g.,  fract(x+.5)-.5 has no longer discontinuity at x = 0 (or at x = integer). If you need to handle both discontinuities at 0 and 1 (e.g. series of bars as above), abs(fract(x+.25)-.5)-.25 will put them in the continuous part of fract. cf example code.
(*) :  E.g. 1: see smoothfloor/smoothfract . E.g. 2: you might sometime use clamp( sin(Pi*x)/Pi / pix, 0.,1. ) instead of int(x)%2 .

If the parameter value is not a simple scaling of coordinates it can be difficult to know the pixel size in these units. But GLSL hardware derivatives can do it for you: pix = fwidth(x) , at least if x is not crazily oscillating faster than pixel rate. But then as a derivative any discontinuity will cause an issue while you were only interested in the coarse gradient. If x contains discontinuities like x=fract(x’) or x=mod(x’), then simply use x’ instead of x in fwidth since it’s just the same gradient without the discontinuity. cf Example code.

Drawing lines

People solved this long ago, so you don’t need to reinvent the wheel 😉 .
The principle is to return the distance to a segment, then to use the “antialiased thick bar” trick above (cf #define S). Note that for a complex drawing you can first compute the min distance to all features then apply the antialiased-bar (and optional coloring) at the very end. You might even use dot(,) rather than length() so as to compute sqrt only once.

float line(vec2 p, vec2 a,vec2 b) { // --- distance to segment with caps
    p -= a, b -= a;
    float h = clamp(dot(p, b) / dot(b, b), 0., 1.);// proj coord on line
    return length(p - b * h);                      // dist to segment
    // We might directly return smoothstep( 3./R.y, 0., dist),
    //     but its more efficient to factor all lines.
    // We can even return dot(,) and take sqrt at the end of polyline:
    // p -= b*h; return dot(p,p);
}

Depending on the use case, you might want the distance to an isolated segment (including caps at ends) or just to the capless segment.  cf Example code.

Blending / compositing

When you splat semi-transparent objects, or once you use antialiasing, rather than setting or adding colors you must compose these semi-transparent layers or you will suffer artifacts.
Below, C is pure object color in RGB and opacity in A, O is current and final color.

Drawing assumed to be from front to back stage (i.e. closest first):
(which allows to stop as soon as opacity is 100% or above some threshold like 99.5%)

O += (1.-O.a) * vec4( C.rgb, 1 ) *C.a;

Drawing assumed to be from back to front stage (i.e. closest last):

O = mix( O, vec4( C.rgb, 1), C.a );

Vector maths

First, a reminder that GLSL directly knows about vectors, matrices, vector geometry operations, blending operations; even most ordinary math functions do work on vectors: see here. Besides geometry, vector can also be used for RGBA colors, for complex numbers, etc. Each time you want to do the same thing on x,y,z (for instance), use them ! The perf won’t be a lot better, but the readability of the code will be a lot more, comprising the reasoning, bug chasing, code evolution.

In addition it’s often convenient to add some more vector constructors like:

#define CS(a)        vec2( cos(a), sin(a) )
#define cart2pol(U)  vec2( length(U), atan((U).y,(U).x) )
#define pol2cart(U) ( (U).x * CS( (U).y ) )

Some operations on complexes: ( vec2 Z  means  Z.x + i Z.y  )

// add, sub;  mul or div by float : just use +, -, *, /
#define cmod(Z)     length(Z)
#define carg(Z)     atan( (Z).y, (Z).x )
#define cmul(A,B) ( mat2( A, -(A).y, (A).x ) * (B) )  // by deMoivre formula
#define cinv(Z)   ( vec2( (Z).x, -(Z).y ) / dot(Z,Z) ) 
#define cdiv(A,B)   cmul( A, cinv(B) )
#define cpow(Z,v)   pol2cart( vec2( pow(cmod(Z),v) , (v) * carg(Z) ) )
#define cpow(A,B)   cexp( cmul( B, clog(A) ) )
#define cexp(Z)     pol2cart( vec2( exp((Z).x), (Z).y ) )
#define clog(Z)     vec2( log(cmod(Z)), carg(Z) )

Rotations:

the simplest is to just return the 2D matrix (even for 3D axial rotations):

#define rot(a)      mat2( cos(a), -sin(a), sin(a), cos(a) )
// use cases:
vec2 v = ... ; v *= rot(a); // attention: left-multiply reverses angle 
vec3 p = ... ; p.xy *= rot(a.z); p.yz*= rot(a.x); ...

Note that the optimizer recognizes identical formulas and won’t evaluate sin and cos twice.

Just for fun, the code golfed version 🙂 :  mat2( cos( a + vec4(0,33,11,0)) )

3D rotation around a given axe A: 
#define rot(P,A,a)  ( mix( dot(p,A)*A , P, cos(a) ) + sin(a)*cross(P,A)  )

Matrices:

Most basic vector and matrix operations are available in Shadertoy.
Some goodies are offered in initializers  ( summarized here ) :
You can initialize a matrix from :

  •  a full list of floats
  •  a full list of column vectors
  •  a mix of float and vectors ( taken as a raw series of numbers )
  • a single float → this will set the diagonal value
  • a larger matrix → it will take the top-left submatrix
  • a smaller matrix → this will set the top-left submatrix, the rest will be initialized from the identity matrix

You can also set or access an individual column with mat[i]

You can normalize a matrix via vec and mat conversions:
mat2(normalize(vec4(m2)))  or  mat2(m2/length(vec4(m2)))

Computing random values

    • Sometime we need the equivalent of drand(), i.e. linear congruence series, that can easily be reimplemented explicitely. cf wikipedia.
    • But most of the time what we really need is a hash value, i.e. a different random value for each pixel, or grid cell, or 3D coord, or 2D+time, etc. And this hash might be a scalar or a vector.
      • For simple use cases, you might rely on the shadertoy 2D or 3D noise textures in grey or RGBA, see special-shadertoy-features . (Take care to not interpolate and reach texel centers if you really want a hash, possibly using nearest flag or texelFetch). Still, the precision is limited (8 bit textures, 64 or 256 resolution).
      • Early integer-less shading languages popularized old-school cheap float-based hashes relying on the chaotic lowest-significant bits after a non-linear operation. (The magic values are important and come from the dawn of computer science age.)
        #define hash21(p) fract(sin(dot(p, vec2(12.9898, 78.233))) * 43758.5453)
        #define hash33(p) fract(sin( (p) * mat3( 127.1,311.7,74.7 , 269.5,183.3,246.1 , 113.5,271.9,124.6) ) *43758.5453123)
        ...

        see many variants here. A problem is that precision is hardware (and compiler) dependent so random values can varies with users. Plus p must be not too small or not too big as well: on poor 16 or 24 bits hardwares the random value might just always be zero.

      • Since webGL2 we can now rely on robust precise (but a bit costlier) integer-based hashes: see reference code , especially the GlibC or NRC refs in Integer Hash – II.
        They usually eat an unsigned, so take care when casting from floats  around zero (since [u]int(-0.5) = [u]int(0.5) ).
      • Attention: the variant introduced by Perlin based on permutation tables is very inefficient in shaders since arrays and texture fetches are ultra-costly, and cascading dependent access of 3D-to-1D wrap is not pipeline-friendly as well.
    • You might not want a hash, but a continuous random noise function. Depending on your needs,
      • you might then be happy with a simple value noise (e.g. simple noise texture with interpolation, or analytic using ref codes),
      • splined value noise,
      • or more costly gradient noise (see ref codes),
      • up to full Perlin noise (gradient + spline interpolation + fractal. NB: Perlin published 3 different algorithms along time: Classical, Improved, Simplex).
        Attention: many shaders or blog named “Perlin noise” indeed just fake a simple gradient or even value noise, with random rotations through scales to mask artifacts. This might be ok for you but don’t confuse for what it is not. Conversely, it’s not a good idea for perfs to use the permutation tables for the hashes.

Profiling, timers, compiled code

Optimizing GPU code, as for parallel programming (but worse), is difficult and unintuitive. Several tools can help in this task.
Alas, none work in webGL, only on desktop. But you can easily port your webGLSL to desktop, and even more easily using some glue tools like libShadertoy (see page Applications compatible with Shadertoy ).

Profiling tools:

nVidia insight (different features depending you want the windows VisualStudio, linux Eclipse, or standalone version).

Getting timers:

  • C-side: timer Query tells you the precise real duration (in ms) of rendering of your drawcall.
  • GLSL-side: Shader clock lets you mesure the current SM clock (in ticks) any time you want in your shader.
    It’s often useful to check the current SM ID, warp ID, and corresponding ranges.

Getting compiled code:

two methods:

  • Compile the shader apart, using cgc compiler or shaderPlayground app(windows).
    Problem: you must choose the target language (openGL, openGL-ES, webGL, …) and language version. Hard to be sure which will be used in your app, especially for webGL in a browser.
  • Getting the assembly from your app: GetProgramBinary()

Now, it’s always interesting to see the generated code, but it’s often not easy to deduce right things from it in terms of how to optimize (apart for very basic arithmetic or algorithmic). For instance key aspects for perfs are number of registers used (because it constraints how many alternate threads can be piled-up to mask wait-states), divergence (conditionally executed parts that will differ for different warp pixels), consequence of wait-states (because waiting for maths or texture or worse, dependent chain – that optimizer can improved by shuffling commands), all things that not easily read in the code or that optimizer could improve by producing a code looking strange at first glance. Also, optimizer tend to produce long code by unrolling everything in order to resolve more operations at compilation time, but this can also yields apparently ugly complex code nonetheless more performant.

In the scope of webGL, remind that upstream of that windows will by default transpile the code to HLSL, using different version of D3D depending you run Firefox or Chrome, instead of GLSL compilation. And the layer Angle transforms your GLSL code to try to fix errors and instabilities of webGL, but different browsers activates different Angle patches or don’t use it at all.
On the other end, the “assembly code” is indeed an intermediate code, that is compiled further in the GPU.
That’s why profiling tools and timers are probably more useful for optimization 😉