Lithos Optimizations

Server-Side · GPU Shader · ARM64 Benchmark — 19 optimizations across the rendering pipeline

Server-Side

GPU Shader

ARM64 CPU

0.879

Lipschitz Factor

43–96

Adaptive Steps

Server-side (lithos-emit.mjs)

GPU shader (GLSL template)

ARM64 CPU (benchmark)

Optimization pipeline

frustum cull→ LOD reduce→ bake constants→ adapt steps→ cache check→ GLSL emit→ AABB guard→ height early-out→ Lipschitz march→ pixels

Server-Side Optimizations

Applied in lithos-emit.mjs before GLSL emission. These run once per camera update, not per frame.

Frustum Culling

server

Objects behind the camera or beyond farPlane are filtered out before the shader is emitted. The server computes dot(objectPos - camPos, camForward) and rejects objects with negative results or distance > farPlane.

Only visible objects appear in the generated GLSL. A full 180-degree turn can remove every object; a narrow canyon view might keep all of them. The frustum check is a single dot product per object — negligible server cost, potentially massive GPU savings.

0–50% objects removed depending on view direction

Distance-Based LOD

server

Far objects are simplified before shader emission. Complex multi-primitive SDFs collapse to single bounding spheres; noise detail is stripped from distant rocks.

Trees and birches beyond the LOD threshold: 6 SDFs → 1 bounding sphere. Rocks beyond threshold lose FBM noise displacement. The visual difference at distance is subpixel.

Threshold: 15–20 meters. Up to 6x fewer SDF evaluations per distant object.

Adaptive March Steps

server

Scene complexity determines the ray march step budget. Fewer visible objects = fewer steps needed to resolve the scene. The server counts post-cull objects and selects the tier.

≤2 objects (45%)

≤5 objects (65%)

≤10 objects (85%)

all visible (100%)

Base step count: 96. Percentage is fraction of base budget. Looking at empty sky = 43 steps. Surrounded by forest = 96.

Up to 55% fewer march iterations per pixel

Constant Baking

server

Terrain heights are computed once on the server and embedded as float literals in the emitted GLSL. Instead of calling terrainH(pos.xz) per march step, the shader reads a constant.

Every object that sits on terrain needs its y-coordinate. Without baking, this is N × terrainH() calls per frame, where N = object count × march steps. With baking, it is zero.

Tree positions → baked y Rock positions → baked y Firefly positions → 20 baked (Virgo) Sunflower positions → 8 baked

Eliminates N × terrainH() calls per frame (N = objects × steps)

Height Early-Out

server

A single branch in the emitted shader: if(p.y > maxH) return ground;. For rays above the canopy height, all object SDFs are skipped entirely — only the terrain distance is returned.

The server computes maxH as the tallest visible object plus margin. Sky-facing rays hit this early-out on every step. Ground-facing rays bypass it when they descend below canopy.

~35% of pixel march budgets saved (sky and horizon pixels)

Shader Caching

server

Emitted shaders are memoized by a key composed of: visible object set + LOD levels + step count. If the camera moves but the same objects remain visible at the same LODs, the cached shader is returned instantly.

Cache key = hash(visibleSet, lodLevels, stepCount). Cache hit = zero-cost re-emission. Cache invalidation is automatic when the visible set changes (camera movement beyond LOD thresholds or frustum boundary).

Cache hit = zero server work, zero recompilation

GPU-Side Optimizations

Applied in the GLSL shader template. These run every frame, every pixel, every march step.

Lipschitz Step Factor 0.879

gpu

Mathematically derived from the Lipschitz constant of FBM terrain. The terrain SDF p.y - terrainH(p.xz) has gradient magnitude L = √(1 + |∇terrainH|²) = 1.137 for the FBM parameters used. The safe step factor is 1/L = 0.879.

Previous heuristic: 0.68 (safe but 22% too conservative — wasting steps). Every ray march step advances t += d * 0.879 instead of t += d * 0.68. Same safety, 29% larger steps, fewer iterations to converge.

29% larger steps vs heuristic 0.68. Fewer iterations to hit surface.

AABB Guards

gpu

Expensive SDF objects are wrapped in axis-aligned bounding box checks. Before evaluating the full SDF, the shader tests: if(abs(p.x - cx) < r && abs(p.z - cz) < r). If the ray position is far from the object, the SDF is never called.

Cost of AABB check: 2–4 ops (abs + compare). Cost of skipped SDF: 20–200 ops (tree, rock, creature). The guard turns an O(objects) per-step cost into O(nearby objects). Most march steps are far from most objects.

Skips 80–95% of SDF evaluations per march step

Nearest-Row Domain Repetition

gpu

Vineyard rows use domain repetition: instead of evaluating all 6 rows, the shader computes the nearest row index from the z-coordinate and only evaluates that row. Adjacent rows are checked only near boundaries.

float rowIdx = round(p.z / rowSpacing); reduces 6 SDF evaluations to 1–2. The modular arithmetic is exact — no visual artifacts. Used in Taurus vineyard scene.

6 row evaluations → 1–2. ~70% fewer vine SDFs.

Analytical Terrain Normal

gpu

Quilez technique: derive the terrain gradient analytically from the noise function instead of using finite-difference epsilon probes. The normal is computed from the partial derivatives of terrainH.

Finite-difference normal: 6 SDF calls (central differences in x, y, z). Analytical normal: 1 call that returns both height and gradient. Used in Taurus terrain shading.

6x fewer SDF calls for terrain normals

Distance-Adaptive AO

gpu

Ambient occlusion step count is reduced for distant surfaces. Near geometry gets full-quality AO; far geometry gets fast AO; flat terrain above camera skips AO entirely.

Near surfaces (<10m): 5 AO taps. Far surfaces (>10m): 3 AO taps. Flat terrain above camera: 0 AO taps (skip entirely). Each tap is a full sceneSDF call, so reducing from 5 to 3 saves 40% of AO cost.

40–100% AO cost reduction depending on distance

Terrain Shadow

gpu

Cheap 4-step shadow march against terrain heightfield + object bounding spheres. Not a full sceneSDF shadow — only checks if terrain or a bounding sphere occludes the light.

Full shadow: march sceneSDF toward light (96 steps × full SDF cost). Terrain shadow: 4 steps × (terrainH + sphere checks). Captures terrain self-shadowing and large-object shadows at ~5% of full shadow cost.

~95% cheaper than full sceneSDF shadow march

Material ID System

gpu

sceneSDF_id returns (distance, materialID) in a single march. After the ray hits, the material ID selects color/roughness/emission without re-evaluating the geometry.

Without material IDs: after hitting a surface, you must re-evaluate each object SDF to determine which one was hit (N SDF calls). With IDs: the closest object's ID is tracked during marching at zero extra cost (just a conditional assignment). Shading is a single switch(id).

Eliminates post-hit geometry re-evaluation

Dot-Fract Hash

gpu

Taurus uses fract((p3.x + p3.y) * p3.z) instead of the conventional sin(dot(p, k)) hash. Avoids the GPU's Special Function Unit (SFU) for sin, which is a shared resource.

SFU operations (sin, cos, exp) serialize on most GPUs — only 1–2 SFUs per compute unit. fract() and multiply are full-throughput ALU ops. Saves ~2 cycles per noise call. In a terrain with 6 octaves of FBM noise evaluated per march step, this adds up.

~2 cycles saved per noise call. Avoids SFU contention.

ARM64 Benchmark Optimizations

Applied in the Lithos benchmark harness for ARM64 (Apple Silicon). These optimize CPU-side math for native rendering and tooling.

4-Accumulator FMLA Chains

arm64

Dot products are unrolled into 4 independent FMLA (fused multiply-accumulate) chains. Each chain feeds a separate accumulator register, breaking the data dependency that would otherwise serialize the pipeline.

GCC emits a single-accumulator loop: each FMLA depends on the previous result (4-cycle latency). With 4 accumulators, the CPU issues 4 FMLA instructions in parallel across execution ports. Final result = horizontal add of 4 accumulators.

11x faster than GCC-generated dot product

NEON Polynomial SFU

arm64

sin/cos computed via NEON SIMD polynomial approximation (Chebyshev or minimax). Processes 4 floats per instruction instead of calling scalar libm functions.

Scalar sinf() from libm: ~40 cycles, one value. NEON polynomial: ~12 cycles, four values simultaneously. The polynomial is tuned for SDF-relevant precision (24-bit mantissa, <1 ULP error in the range used).

12–17x faster than scalar libm sin/cos

Post-Increment Addressing

arm64

Load/store pairs (ldp/stp) use post-increment addressing mode with pre-computed offsets. This avoids Read-After-Write (RAW) hazards on the address register.

Pre-offset mode: ldp x0, x1, [x2, #16]! writes x2 before reads complete — potential pipeline stall. Post-increment: ldp x0, x1, [x2], #16 reads first, increments after. The address register is free for the next instruction immediately.

Eliminates RAW pipeline stalls in memory-intensive loops

N-Gram Fusion

arm64

128 adjacent-opcode patterns are fused to eliminate redundant store-load pairs. When instruction A stores to memory and instruction B immediately loads from the same address, the pair is fused into a register-to-register move.

Compiler-generated code often stores intermediate results to the stack and immediately reloads them. The fusion pass identifies these patterns and replaces them with register forwarding. 128 fusion rules cover the most common SDF evaluation patterns.

Eliminates store-load pairs. Reduces memory traffic in hot loops.

SME2 Matrix Coprocessor

arm64

Apple Silicon M4's Scalable Matrix Extension 2 (SME2) coprocessor used for matrix multiply operations. Dedicated silicon for outer-product accumulation at maximum throughput.

SME2 operates on ZA tile registers (up to 512-bit). A single FMOPA instruction computes a full outer product and accumulates into a tile. For 4x4 and larger matmul: 555 GFLOPS sustained throughput on M4.

555 GFLOPS for matmul (100x scalar throughput)

Summary Table

#	Optimization	Layer	Typical Speedup	Implementation
1	Frustum culling	Server	0–50% objects removed	`lithos-emit.mjs`
2	Distance-based LOD	Server	6x fewer SDFs per far object	`lithos-emit.mjs`
3	Adaptive march steps	Server	Up to 55% fewer steps	`lithos-emit.mjs`
4	Constant baking	Server	Eliminates N×terrainH()	`lithos-emit.mjs`
5	Height early-out	Server	~35% pixel budgets saved	`lithos-emit.mjs`
6	Shader caching	Server	Zero-cost re-emission on hit	`lithos-emit.mjs`
7	Lipschitz step factor	GPU	29% larger steps vs 0.68	GLSL template
8	AABB guards	GPU	80–95% SDF evals skipped	GLSL template
9	Nearest-row repetition	GPU	~70% fewer vine SDFs	GLSL template (Taurus)
10	Analytical terrain normal	GPU	6x fewer normal SDF calls	GLSL template (Taurus)
11	Distance-adaptive AO	GPU	40–100% AO cost reduction	GLSL template
12	Terrain shadow	GPU	~95% cheaper than full shadow	GLSL template
13	Material ID system	GPU	Eliminates post-hit re-eval	GLSL template
14	Dot-fract hash	GPU	~2 cycles/noise call saved	GLSL template (Taurus)
15	4-accumulator FMLA	ARM64	11x vs GCC dot product	Lithos benchmark
16	NEON polynomial SFU	ARM64	12–17x vs scalar libm	Lithos benchmark
17	Post-increment addressing	ARM64	Eliminates RAW stalls	Lithos benchmark
18	N-gram fusion	ARM64	128 store-load pairs fused	Lithos benchmark
19	SME2 matrix coprocessor	ARM64	555 GFLOPS (100x scalar)	Lithos benchmark