본문으로 건너뛰기

Render Cost Prediction

Reference sheet for computing GPU render cost of 2D scene operations before drawing. All constants and formulas are derived from GPU pipeline structure, not empirical tuning.

Related:


Core Principle: Fill Rate Dominance

2D GPU rendering is memory-bandwidth bound, not compute bound. The fragment shader for a rect fill is ~1 ALU op; even a Gaussian blur pass is ~10 ALU ops per pixel. Modern GPUs execute trillions of ALU ops/sec, but memory bandwidth is 50-200 GB/s. Each pixel read/write is 4-16 bytes.

Therefore:

frame_cost ≈ total_pixels_touched / memory_bandwidth

This relationship is linear. Double the pixels, double the time. No surprises, no non-linear scaling — as long as you stay within VRAM and don't hit texture cache thrashing (rare in 2D; access is spatially coherent).

This means render cost can be pre-computed as an ALU/pixel budget: count the pixels the GPU will touch, apply structural multipliers per effect, and compare against a calibrated device budget.


Effect Cost Constants

These are not magic numbers or tuning parameters. They are the structural pass counts of each rendering operation — how many full-area read-write cycles the GPU performs.

EffectPixel MultiplierDerivation
Plain shape (rect, ellipse, polygon)Single fill pass
Additional fill (N fills on one node)+1× per extra fillEach fill is a separate pass
Additional stroke+1× per strokeSeparate pass
Non-rect clip path+1×Mask pass + masked content
Rect clip+0×Hardware scissor — free
Blend mode (non-normal)+1×Requires offscreen isolation layer
Group opacity (alpha < 1.0 on group)+1×save_layer for isolated compositing
Gaussian blur+3×Downsample pyramid (~1.33×) + blur + upsample + composite
Drop shadow+5×Draw shape (1×) + blur pipeline (3×) + composite back (1×)
Inner shadow+5×Same as drop shadow, inverted mask
Backdrop filter (background blur)+3×Snapshot dst + blur + composite
Layer blur (on node itself)+3×Offscreen + blur + composite
Image fill+0× over baseTexture sample replaces color fill — same bandwidth
Multiple shadows+5× per shadowEach shadow is independent

Blur Radius Independence

Skia (and most GPU frameworks) implement Gaussian blur via a downsample pyramid, not a brute-force kernel convolution:

large sigma → downsample 2× → downsample 2× → ... → blur at reduced size → upsample

Total pixel work = area × (1 + 1/4 + 1/16 + ...) ≈ area × 1.33 (geometric series), plus the blur pass at reduced resolution. The cost is approximately constant regardless of blur radius. The pyramid absorbs the radius.

save_layer / save_layer_alpha — The Hidden Spike Source

save_layer is the single most expensive primitive in Skia. It allocates an offscreen surface, renders content into it, then composites back.

save_layer_cost = layer_bounds_area × zoom² × 2  (write to offscreen + read back)

Critical: they cascade multiplicatively with nesting depth.

save_layer              ← offscreen A (full group bounds)
save_layer ← offscreen B (child bounds)
save_layer ← offscreen C (grandchild bounds)
draw rect
restore → composite C into B
restore → composite B into A
restore → composite A into target

Three nested layers on the same area = area × 6 bandwidth, not area × 2.

Implicit save_layer triggers

Skia inserts save_layer implicitly for these conditions. The cost estimator must account for them even when the application code does not call save_layer explicitly:

TriggerReason
Non-normal blend mode on a groupIsolated offscreen to blend against dst
Group opacity (alpha < 1.0 with children)Children must composite together first, then alpha applied once
Blur / backdrop filterReads from dst, needs snapshot
Clip + antialiasing on groupsSoft-edge mask requires offscreen
ColorFilter on a groupApplied after children composite

Per-Node Cost Formula

fn estimated_fill_pixels(node: &Node, zoom: f32, viewport: &Rect) -> f64 {
let screen_area = clipped_area(&node.bounds, viewport) * (zoom * zoom) as f64;

// Base draw
let mut passes: f64 = 1.0;

// Extra fills/strokes beyond the first
passes += (node.fill_count.saturating_sub(1)) as f64;
passes += node.stroke_count as f64;

// Effects
for shadow in &node.shadows {
if shadow.visible {
passes += 5.0; // shape + blur pipeline + composite
}
}
if node.has_blur() {
passes += 3.0; // downsample + blur + composite
}
if node.has_backdrop_blur() {
passes += 3.0;
}

// Isolation layers (implicit save_layer)
if node.blend_mode != BlendMode::Normal {
passes += 1.0; // offscreen + composite
}
if node.opacity < 1.0 && node.has_children() {
passes += 1.0; // group opacity isolation
}

// Clip
if node.has_non_rect_clip() {
passes += 1.0; // mask pass
}

screen_area * passes
}

Cache Hit vs. Miss Cost

A compositor/picture cache hit replaces the full rasterization pipeline with a single texture blit:

StateEffective multiplierWhat happens
Cache misspasses × (from table above)Full rasterization: path tessellation, fill, effects
Cache hit~0.1×Single texture-sampled quad draw

The cost difference is 100-1000×. Cache state is a binary signal — the single largest contributor to per-node cost variance.


Device Fill Rate Reference

The total pixel budget depends on device fill rate — the one value that varies per hardware. Everything else is derived from geometry and scene structure.

Calibration

Render a known workload (e.g., full-screen solid rect) and measure:

pixels_per_ms = (screen_width × screen_height) / render_time_ms

Reference Values (order-of-magnitude)

PlatformExpected pixels_per_ms
Desktop GPU (discrete)~500M
Desktop GPU (integrated)~100M
WebGL (WASM, desktop)~50-100M
WebGL (WASM, mobile)~10-30M

Chromium Reference

Chromium's cc/ compositor collects similar metrics but uses them differently:

MetricChromium LocationChromium Usage
TotalOpCount()cc/paint/display_item_list.hSolid-color analysis gate
num_slow_paths_up_to_min_for_MSAA()cc/paint/display_item_list.hPage-level GPU raster veto
has_save_layer_ops()cc/paint/display_item_list.hLCD text decision
has_non_aa_paint()cc/paint/display_item_list.hAntialiasing decisions
BytesUsed() / OpBytesUsed()cc/paint/display_item_list.hTracing / debugging
AreaOfDrawText()cc/paint/display_item_list.hText coverage statistics
Solid color analysiscc/tiles/tile_manager.ccSkip rasterization for uniform tiles (kMaxOpsToAnalyze = 5)

Chromium does not perform per-tile raster cost prediction. Tile scheduling is purely spatial (viewport distance + scroll velocity) with a memory budget constraint. Their architecture tolerates stale tiles (multi-threaded raster catches up across frames). Ours cannot — we render single-threaded with a hard per-frame deadline, requiring predictive budgeting.

Local source: /Users/softmarshmallow/Documents/Github/chromium/cc/


Skia Picture Metrics (Available for Free)

Skia's Picture object exposes complexity metrics that are already computed during recording and cost nothing to query:

MethodWhat it returnsUse
approximate_op_count()Number of draw operations recordedSecondary complexity signal
approximate_bytes_used()Serialized size of the pictureMemory pressure / complexity proxy

These are stored fields, not computations. They complement the pixel-area model by capturing path complexity variance (a 1000-op picture with complex beziers vs. a 3-op picture with simple rects at the same pixel area).


Linearity Bounds

The fill-rate model is linear under these conditions:

ConditionLinear?Notes
Work above ~10K pixelsYesBelow this, GPU launch overhead dominates (flat floor)
Spatial texture access (normal 2D)YesBandwidth-bound, no cache thrashing
Random texture accessCan be super-linearRare in 2D rendering
Tile-based GPU (mobile)MostlyLarge nodes spanning many tiles add per-tile overhead
Thermal throttlingN/ABetween-frame variance, not within-frame
VRAM pressure / swappingNon-linearCatastrophic; avoid by staying within budget

For typical 2D canvas rendering (spatial access, nodes > 10K pixels), the linear model holds.