Applied in lithos-emit.mjs before GLSL emission. These run once per camera update, not per frame.
1
Frustum Culling
server
Objects behind the camera or beyond farPlane are filtered out before the shader is emitted. The server computes dot(objectPos - camPos, camForward) and rejects objects with negative results or distance > farPlane.
Only visible objects appear in the generated GLSL. A full 180-degree turn can remove every object; a narrow canyon view might keep all of them. The frustum check is a single dot product per object — negligible server cost, potentially massive GPU savings.
0–50% objects removed depending on view direction
2
Distance-Based LOD
server
Far objects are simplified before shader emission. Complex multi-primitive SDFs collapse to single bounding spheres; noise detail is stripped from distant rocks.
Trees and birches beyond the LOD threshold: 6 SDFs → 1 bounding sphere. Rocks beyond threshold lose FBM noise displacement. The visual difference at distance is subpixel.
Threshold: 15–20 meters. Up to 6x fewer SDF evaluations per distant object.
3
Adaptive March Steps
server
Scene complexity determines the ray march step budget. Fewer visible objects = fewer steps needed to resolve the scene. The server counts post-cull objects and selects the tier.
43
≤2 objects (45%)
62
≤5 objects (65%)
82
≤10 objects (85%)
96
all visible (100%)
Base step count: 96. Percentage is fraction of base budget. Looking at empty sky = 43 steps. Surrounded by forest = 96.
Up to 55% fewer march iterations per pixel
4
Constant Baking
server
Terrain heights are computed once on the server and embedded as float literals in the emitted GLSL. Instead of calling terrainH(pos.xz) per march step, the shader reads a constant.
Every object that sits on terrain needs its y-coordinate. Without baking, this is N × terrainH() calls per frame, where N = object count × march steps. With baking, it is zero.
Eliminates N × terrainH() calls per frame (N = objects × steps)
5
Height Early-Out
server
A single branch in the emitted shader: if(p.y > maxH) return ground;. For rays above the canopy height, all object SDFs are skipped entirely — only the terrain distance is returned.
The server computes maxH as the tallest visible object plus margin. Sky-facing rays hit this early-out on every step. Ground-facing rays bypass it when they descend below canopy.
~35% of pixel march budgets saved (sky and horizon pixels)
6
Shader Caching
server
Emitted shaders are memoized by a key composed of: visible object set + LOD levels + step count. If the camera moves but the same objects remain visible at the same LODs, the cached shader is returned instantly.
Cache key = hash(visibleSet, lodLevels, stepCount). Cache hit = zero-cost re-emission. Cache invalidation is automatic when the visible set changes (camera movement beyond LOD thresholds or frustum boundary).
Cache hit = zero server work, zero recompilation
GPU-Side Optimizations
Applied in the GLSL shader template. These run every frame, every pixel, every march step.
7
Lipschitz Step Factor 0.879
gpu
Mathematically derived from the Lipschitz constant of FBM terrain. The terrain SDF p.y - terrainH(p.xz) has gradient magnitude L = √(1 + |∇terrainH|²) = 1.137 for the FBM parameters used. The safe step factor is 1/L = 0.879.
Previous heuristic: 0.68 (safe but 22% too conservative — wasting steps). Every ray march step advances t += d * 0.879 instead of t += d * 0.68. Same safety, 29% larger steps, fewer iterations to converge.
29% larger steps vs heuristic 0.68. Fewer iterations to hit surface.
8
AABB Guards
gpu
Expensive SDF objects are wrapped in axis-aligned bounding box checks. Before evaluating the full SDF, the shader tests: if(abs(p.x - cx) < r && abs(p.z - cz) < r). If the ray position is far from the object, the SDF is never called.
Cost of AABB check: 2–4 ops (abs + compare). Cost of skipped SDF: 20–200 ops (tree, rock, creature). The guard turns an O(objects) per-step cost into O(nearby objects). Most march steps are far from most objects.
Skips 80–95% of SDF evaluations per march step
9
Nearest-Row Domain Repetition
gpu
Vineyard rows use domain repetition: instead of evaluating all 6 rows, the shader computes the nearest row index from the z-coordinate and only evaluates that row. Adjacent rows are checked only near boundaries.
float rowIdx = round(p.z / rowSpacing); reduces 6 SDF evaluations to 1–2. The modular arithmetic is exact — no visual artifacts. Used in Taurus vineyard scene.
6 row evaluations → 1–2. ~70% fewer vine SDFs.
10
Analytical Terrain Normal
gpu
Quilez technique: derive the terrain gradient analytically from the noise function instead of using finite-difference epsilon probes. The normal is computed from the partial derivatives of terrainH.
Finite-difference normal: 6 SDF calls (central differences in x, y, z). Analytical normal: 1 call that returns both height and gradient. Used in Taurus terrain shading.
6x fewer SDF calls for terrain normals
11
Distance-Adaptive AO
gpu
Ambient occlusion step count is reduced for distant surfaces. Near geometry gets full-quality AO; far geometry gets fast AO; flat terrain above camera skips AO entirely.
Near surfaces (<10m): 5 AO taps.
Far surfaces (>10m): 3 AO taps.
Flat terrain above camera: 0 AO taps (skip entirely).
Each tap is a full sceneSDF call, so reducing from 5 to 3 saves 40% of AO cost.
40–100% AO cost reduction depending on distance
12
Terrain Shadow
gpu
Cheap 4-step shadow march against terrain heightfield + object bounding spheres. Not a full sceneSDF shadow — only checks if terrain or a bounding sphere occludes the light.
Full shadow: march sceneSDF toward light (96 steps × full SDF cost). Terrain shadow: 4 steps × (terrainH + sphere checks). Captures terrain self-shadowing and large-object shadows at ~5% of full shadow cost.
~95% cheaper than full sceneSDF shadow march
13
Material ID System
gpu
sceneSDF_id returns (distance, materialID) in a single march. After the ray hits, the material ID selects color/roughness/emission without re-evaluating the geometry.
Without material IDs: after hitting a surface, you must re-evaluate each object SDF to determine which one was hit (N SDF calls). With IDs: the closest object's ID is tracked during marching at zero extra cost (just a conditional assignment). Shading is a single switch(id).
Eliminates post-hit geometry re-evaluation
14
Dot-Fract Hash
gpu
Taurus uses fract((p3.x + p3.y) * p3.z) instead of the conventional sin(dot(p, k)) hash. Avoids the GPU's Special Function Unit (SFU) for sin, which is a shared resource.
SFU operations (sin, cos, exp) serialize on most GPUs — only 1–2 SFUs per compute unit. fract() and multiply are full-throughput ALU ops. Saves ~2 cycles per noise call. In a terrain with 6 octaves of FBM noise evaluated per march step, this adds up.
~2 cycles saved per noise call. Avoids SFU contention.
ARM64 Benchmark Optimizations
Applied in the Lithos benchmark harness for ARM64 (Apple Silicon). These optimize CPU-side math for native rendering and tooling.
15
4-Accumulator FMLA Chains
arm64
Dot products are unrolled into 4 independent FMLA (fused multiply-accumulate) chains. Each chain feeds a separate accumulator register, breaking the data dependency that would otherwise serialize the pipeline.
GCC emits a single-accumulator loop: each FMLA depends on the previous result (4-cycle latency). With 4 accumulators, the CPU issues 4 FMLA instructions in parallel across execution ports. Final result = horizontal add of 4 accumulators.
11x faster than GCC-generated dot product
16
NEON Polynomial SFU
arm64
sin/cos computed via NEON SIMD polynomial approximation (Chebyshev or minimax). Processes 4 floats per instruction instead of calling scalar libm functions.
Scalar sinf() from libm: ~40 cycles, one value. NEON polynomial: ~12 cycles, four values simultaneously. The polynomial is tuned for SDF-relevant precision (24-bit mantissa, <1 ULP error in the range used).
12–17x faster than scalar libm sin/cos
17
Post-Increment Addressing
arm64
Load/store pairs (ldp/stp) use post-increment addressing mode with pre-computed offsets. This avoids Read-After-Write (RAW) hazards on the address register.
Pre-offset mode: ldp x0, x1, [x2, #16]! writes x2 before reads complete — potential pipeline stall. Post-increment: ldp x0, x1, [x2], #16 reads first, increments after. The address register is free for the next instruction immediately.
Eliminates RAW pipeline stalls in memory-intensive loops
18
N-Gram Fusion
arm64
128 adjacent-opcode patterns are fused to eliminate redundant store-load pairs. When instruction A stores to memory and instruction B immediately loads from the same address, the pair is fused into a register-to-register move.
Compiler-generated code often stores intermediate results to the stack and immediately reloads them. The fusion pass identifies these patterns and replaces them with register forwarding. 128 fusion rules cover the most common SDF evaluation patterns.
Eliminates store-load pairs. Reduces memory traffic in hot loops.
19
SME2 Matrix Coprocessor
arm64
Apple Silicon M4's Scalable Matrix Extension 2 (SME2) coprocessor used for matrix multiply operations. Dedicated silicon for outer-product accumulation at maximum throughput.
SME2 operates on ZA tile registers (up to 512-bit). A single FMOPA instruction computes a full outer product and accumulates into a tile. For 4x4 and larger matmul: 555 GFLOPS sustained throughput on M4.