Advanced tricks

Smart antialiasing

Antialiasing for free is easy in 2D shaders, as long as you can express shapes as distances: just blur borders over a pixel using smoothstep ( treated here ).
But this solution doesn’t work for 3D shaders, especially ray-tracing / casting / marching / surfing ones. Beside advanced complex techniques such as voxelization (since volumes are naturally smooth) or cone-tracing (many variants, quite involved), the basic solution is oversampling: consider a subresolution in each pixel and average (the theory of the perfect filtering kernel is a bit more complicated, but let keep it simple 🙂 ). This also applies to complex 2D shaders not simply based on distances.
The inconvenient is that it multiplies the cost of the shader by the sampling size N, with N = 3×3 being already costly and not perfectly anti-aliased. Still, most of the time the aliasing only occurs in a small proportion of the pixels in the window: at shape silhouettes, shadows border, highlights, etc. Oversampling these regions only should thus do the trick for not too expansive: this is adaptive sampling.

One way to implement it is to define your render function in Common tab, to call it once as usual in BufferA, and in Image tab to either just display it or add more samples if there is too much difference with neighbor pixels (your choice of criterion here). Some examples.

Another way is to relate on the marvelous hardware derivatives again (a reminder that they apply to any variable or expression and are almost free). It’s not perfect ( 2×2 block neighborhood ) but the code is then ultra simple: just rename your main as mainImage0 and append this at the end of your code:

// === easy adaptive sampling. === https://shadertoyunofficial.wordpress.com/2021/03/09/advanced-tricks/
//                           more: https://www.shadertoy.com/results?query=easy+adaptive+sampling
void mainImage(out vec4 O, vec2 U) {
    mainImage0(O,U);
    if ( fwidth(length(O)) > .01 ) {  // difference threshold between neighbor pixels
        vec4 o;
        for (int k=0; k < 9; k+= k==3?2:1 )
          { mainImage0(o,U+vec2(k%3-1,k/3-1)/3.); O += o; }
        O /= 9.;
     // O.r++;                        // uncomment to see where the oversampling occurs
    }
}

To be adapted to higher sampling if you wish. Jittering the samples would even improve this basic scheme. You might even cascade this twice via a buffer to allow higher sampling on tough areas only.

Note that if your main was ending with a sRGB conversion pow(col,1./2.2) − as it should − , move it at the end of the new main. Antialiasing looks better with the proper final sRGB space ( a reminder that otherwise a 0.5 grey don’t perceptually appears as the mean intensity between black and white ).

Smart Gaussian blur

Hardware MIPmap is a cheap way to get blurred images (even on your Buffers: just activate the MIPmap filter at binding), but the filter is multi-crappy: box filter, axis-aligned, on power-of-two blocks (otherwise bilinearly interpolated), evaluated at power-of-two scales (and blending the 2 nearest levels otherwise).
Oppositely, convolving an image with a large kernel is very costly.

But many filters are smooth: you can consider them as piecewise box filters. Meaning, instead of applying them on the full resolution image, you can apply them on an intermediate MIPmap LOD: choose the quality/cost compromise you wish by setting the number of samples you want to afford, then select the LOD to use accordingly. Using LOD L means saving the cost by a factor 4L .

Second trick: Gaussians are lovely functions, and the unique family that has the separability property: a 2D Gaussian is the product of two 1D Gaussians in x and y. Meaning: you can 1D filter horizontally in a buffer, then vertically in the next buffer: cost N instead of N². On top of that, the previous trick saves by a factor 2L.

See example.

Note that hardware built-MIPmap is very buggy on non-power-of-two textures (and horribly buggy on Windows) , so very large kernels kernels might show artifacts if you use the coarsest levels. Future shadertoy implementation will allow custom image size (then chose powers-of-two) and manually built MIPmap, but for now only CubeMap(A) is MIPmap friendly up to the coarsest level.

General Purpose MIPmap

Massively parallel programming with Cuda or Compute shaders allows more complex algorithm than GLSL, for instance by cascading passes of very different resolutions. This is typically required by some not parallel-friendly operations like sorting, compacting zeros, extracting&compacting values above a level (or the local max). Managing this in parallel leads to the smart stream reduction algorithms.
In GLSL we can at least access the single built-in one: hardware built-MIPmap. The trick is trying to use that for many general purposes, considering that is allows very fast integrals on pixels blocks (axis aligned, power-of-2 blocks, for power-of-2 sizes, but still). Examples:

  • Average color of an image ( = coarsest MIPmap ).
    Or average in large local blocks ( = less coarse level in nearest mode or via texelFetch ).
  • Standard deviation or variance of colors ( whole image, or per blocks ):
    Make a buffer containing the image values squared. Then, Variance = MIPmap(I²) - MIPmap(I)²
    Other applications: auto-constrast (another), white balance (better), select frequency, filtering LUTs, height/bump map filtering
  • Center of Mass of a grey image ( whole image, or per blocks ):
    Make a buffer containing uv*grey. Center of mass = MIPmap(uv*I) / MIPmap(I)
  • Order-2 Moments / inertia matrix J / fitting ellipse / covariant matrix :
    Make another buffer containing vec3(u²,v²,u*v)*grey. Then Jxx = MIPmap(u²I)/MIPmap(I) - (MIPmap(uI)/MIPmap(I))² . Same for Jyy and Jxy → J = mat2(Jxx,Jxy,Jxy,Jyy)
    Other applications: iceberg simulator
  • Surface of an object in a mask / counting set pixels : it’s just the coarsest MIPmap level.
  • Collisions / Intersection:
    Have the 2 masks in R and G channel, make B = R*G. Coarsest MIPmap gives size of intersecting surfaces : collision is > 0.
    Some applications: packing shape with dart throwing, heat/salient map ordered algos (example)…
  • Building a min/max pyramid:
    using ( MIPmap(I-big) )big and ( MIPmap(Ibig) )-big
    Other applications: find extrema,…
  • Poor man Low-pass Filter, High-pass Filter, Differential of “Gaussians” filter (select a frequency band):
    just LOD, LOD – image, substract two LODs.
    Other applications: cartooning, …
  • Cheap high quality true Gaussian filter (see above):
    Do the convolution, but on a coarser image than the full resolution.
  • Detect high variations in neighborhood:
    Compare buffer value to MIPmapped value (LOD depending on neighbor size to consider. For 2×2 neighborhood, you can also consider hardware derivatives, i.e., fwidth() values ).
    Some applications: adaptive antialiasing (see above), …

More: All General Purpose MIPmap examples.

Note that hardware built-MIPmap is very buggy on non-power-of-two textures (and horribly buggy on Windows), which makes coarsest MIPmap levels unusable on buffers. Future shadertoy implementation will allow custom image size (then chose powers-of-two) and manually built MIPmap, but for now use CubeMapA (providing six 1024×1024 buffers in half floats ) if you need correct coarsest level.

Splatting thousands of particles: Voronoï particle tracking

In Shadertoy we are stuck in fragments shaders. We are also stuck in WebGL2 / GLSL ES 3.0 (with some restriction), and thus we don’t have access to the cool GLSL 4.2 Image Load/Store feature allowing to write to any location in an output buffer: imageStore(buff, destCoord, data).

It is part of the beauty of the constraint and style of efficient massive parallelism to design procedural shaders in the pull rather than push spirit, forcing newbies to return their mind inside out as a sock rather than ultra-inefficiently programming a GPU as they code in Python, C/C++ or Processing. Now when it comes to particle-based simulations we have no choice: we have a huge bunch of elements at random evolving locations, and to render them or find their neighbor we have to search them all. And there, your smart real-time simulation is ruined by even the poorest final rendering with its monstrously long splatting loop, while generally 0 and sometime 1 particle will cover the current pixel. Can we do better ?

Relying on acceleration data-structures to structure the chaos is classical: in a first pass, store the elements in a grid, then whenever needed fetch the grid cell at the required location to get the nearby elements. But in fragment shaders, this first pass is just another splatting random data problem. Still, we have ways to do that iteratively.

Cameleon storage: One way is to try to store an element in the particle data buffer as close as possible to its real spatial location, by swapping elements along time. Then at render time you only have to fetch data in a small block around the target pixel. But this is quite impossible to make such approach robust: without coordination, several elements might want to go to the same data entry, so a same particle is better saved many time (spoil of resource, risk of overdrawing).

Voronoï particle tracking ( or 1-buffer version ) : More complicated to understand, but way more robust and efficient. The particles data buffer stays organized as usual: no special order. As acceleration structure we introduce a Voronoï buffer storing in RGBA the id of the 4 (sorted) closest particles to the current pixel location, that can directly be used at render time (use only value 1, or up to 4 if you want to manage superimposition or connectivity ). The whole point is to dynamically improve then update the choice of ids in the structure. This is done on a slow flooding basis (but surprisingly fast in practice), by testing whether the ids stored in the direct neighbor cells are even closer, and updating our local best-4 (testing the 4 neighbors seems enough). To initialize the process, and recover from occasional particle loss (e.g. when too many at the same place), we also test a few random ids (in practice just 1 per pixel and per time step is generally enough).
A side effect of the method is that we get the 1-4 closest particles at any screen pixel 🙂 . Extension to 3D is easy, but cells must be larger and thus storing 4 neighbors only allows less dense distributions than in 2D.
Beside the reference implementations above (links in title), some example: ( Michael0884 gathered a playlist. )
2D: gas (interacting with video), vortex simulation, massive collisions, SPH, massive boids swarm, Voronoï triangulation, closest graph, same in fluid particles, edge tracking, Delaunay triangulation, borders tracking with snakes.
3D: Lorentz attractor, 3D boids, 3D fluid particles, 3D SPH, 3D graph, massive collisions, cloth simulation.

( To be continued )

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s