Architecture and design decisions

Pipeline design

The pipeline processes images one at a time, completing all encoding variants and quality measurements for each image before moving to the next:

For each image (until time budget exhausted):
  └─ Encode all variants (formats × quality levels)
      └─ Measure all variants (SSIMULACRA2, Butteraugli, PSNR, SSIM)
          └─ Save results to disk

This “atomic per-image” design was chosen over the alternative of encoding all images first, then measuring all:

Partial results are always usable. If a study is interrupted or the time budget expires, every completed image has full encoding + measurement data. There are no orphaned encodings without metrics.
Error isolation. A failure on one image (e.g., an encoder crash on a specific input) does not block other images.

Time-budget approach

Studies are configured with a time budget (e.g., 30m, 2h) rather than a fixed image count. The pipeline encodes images from the dataset until the budget expires, then stops. This is more practical than guessing how many images to process:

Different encoder/quality combinations have wildly different speeds (AVIF speed 0 is ~100× slower than speed 10).
Multi-format studies multiply the per-image time by the number of variants.
The user sets a wall-clock time they’re willing to wait, and gets as many data points as fit.

Worker model

Each image is processed by a single worker thread. The --workers flag controls parallelism. Because the encoding tools (cjpeg, cwebp, avifenc, cjxl) are CPU-intensive native binaries, the bottleneck is CPU time, and Python’s GIL does not limit throughput here — subprocess calls release the GIL.

Study system

A study is the central unit of work. Each study is defined by a JSON config file in config/studies/ that specifies:

Which dataset to use
Which formats and parameter ranges to encode
Study metadata (name, description)

The study ID (filename without .json) is used everywhere: directory names under data/encoded/<study-id>/, data/metrics/<study-id>/, data/analysis/<study-id>/, and in CLI commands (just pipeline <study-id> <time-budget>).

This design means:

Adding a new experiment is just creating a new JSON file — no code changes.
Reproducibility — the config file fully describes what was run.
Multiple studies coexist — each study’s outputs are isolated in its own subdirectory.

Data separation

All data lives under data/, strictly separated from code:

data/
├── datasets/        # Raw downloads (from just fetch)
├── preprocessed/    # Resolution-scaled images (per study)
├── encoded/         # Compressed images (per study, per format)
├── metrics/         # Quality measurements (per study)
├── analysis/        # Plots and statistics (per study)
└── report/          # Generated HTML reports

Everything under data/ is git-ignored (except .gitkeep markers). This means:

The repository stays small regardless of how many studies are run.
Datasets can be re-downloaded; encoded/metrics/analysis can be regenerated.
Code changes are cleanly separated from data changes in version control.

Configuration over code

Pipeline parameters live in JSON config files rather than being hardcoded:

config/datasets.json — dataset sources, URLs, storage types
config/studies/*.json — study definitions (formats, quality ranges, speeds)
JSON schemas (config/*.schema.json) — validate configs at load time

There are no “future” config files for preprocessing, quality, or analysis — those behaviours are determined by the study config and the source code defaults. This keeps the config surface small and avoids speculative abstraction.

See Add a custom dataset and Create a custom study for guides on extending the configuration.

Format choices

Format	Role	Why included
JPEG	Baseline	Universal support, well-understood, the format everything is compared against
WebP	Established alternative	Broad browser support, good compression, Google-backed
AVIF	Primary research target	Based on AV1, excellent low-bitrate compression, the main focus of this project
JPEG XL	Next-generation	Strong technical merits (progressive decode, lossless round-trip), limited browser support

Metric choices

The project measures both perceptual and traditional metrics:

Metric	Type	Why
SSIMULACRA2	Perceptual	Designed specifically for lossy compression evaluation; most accurate for this use case
Butteraugli	Perceptual	Models human visual system with a different mathematical approach; complements SSIMULACRA2
PSNR	Traditional	Simple, widely used, easy to compare with published literature
SSIM	Traditional	Better than PSNR for structural comparison, well-known baseline

Perceptual metrics (SSIMULACRA2, Butteraugli) are prioritised in analysis because they correlate better with human judgement than PSNR/SSIM for compression artifacts.

Dev container

The encoding tools (avifenc, cjxl, cwebp, cjpeg) and measurement tools (ssimulacra2, butteraugli_main, avifdec, djxl) require specific builds. A dev container ensures:

Reproducibility — everyone gets the exact same tool versions.
No host pollution — build dependencies don’t touch the host system.
CI parity — the same image runs in CI and locally.

Python 3.13, single target

This is a research project, not a distributable library. Targeting a single Python version (3.13) simplifies testing, avoids compatibility workarounds, and lets the code use the latest language features without conditional logic.

Dependency management

All dependencies are declared in pyproject.toml:

pip install -e . — production dependencies
pip install -e ".[dev]" — adds pytest, mypy, ruff, type stubs

There is no requirements.txt. The pyproject is the single source of truth.

Comparison module design

The comparison.py module is decoupled from the main pipeline. The pipeline is a pure encode-and-measure step that writes quality.json; the comparison module reads that file and independently re-encodes images to produce its figures.

This separation has several benefits:

Tunable without re-running the pipeline — comparison targets, tile parameter, and excluded images can be changed in the study config and the comparison regenerated without a costly re-run of the encoding pipeline.
No pipeline artefacts required — the comparison script re-encodes from the original dataset images using interpolated quality settings, so the pipeline no longer needs to save encoded artefacts to disk.
Richer figure types — because quality settings are interpolated on the fly, the comparison can produce figures at arbitrary target metric values (e.g., SSIMULACRA2 = 60, 75, 90) or file-size targets (e.g., bits_per_pixel = 0.5, 1.0, 1.5) regardless of which quality levels the pipeline swept.

Image and fragment selection

The comparison module uses two layers of selection to maximise the visual informativeness of each figure:

Cross-format CV (src/interpolation.py:select_best_image) — the source image that shows the highest relative spread (coefficient of variation = std / mean) of the output metric across encoding variants is selected. Using CV instead of raw variance avoids bias towards inherently brighter or higher-quality images.
Anisotropic std map — across all target values in a group, per-pixel Butteraugli distortion maps are aggregated into a single anisotropic standard-deviation map. The crop region with the highest mean std is chosen. The same fragment is shared across all target values in the group, making visual differences directly comparable.

CI design

Lint & type check — runs on bare Ubuntu with Python 3.13 for fast feedback.
Build image — builds the dev container and pushes to GHCR with layer caching.
Test suite — runs inside the built container (depends on both above), ensuring encoding/measurement tools are available for integration tests.
Markdown lint — runs independently in parallel.

Running tests inside the dev container (rather than on bare Ubuntu) ensures that all native tools are available and integration tests produce accurate results.

Beyond CI, the project also uses GitHub Actions as a public research platform — see Public research with GitHub Actions for details.

Module architecture

The src/ modules map to pipeline stages and post-processing:

Module	Purpose
`study.py`	Load and validate study configs
`dataset.py`	Fetch and manage image datasets
`preprocessing.py`	Resize images by longest edge
`encoder.py`	Encode images via subprocess calls to native tools
`quality.py`	Measure quality metrics via subprocess calls
`pipeline.py`	Orchestrate encode → measure per image with time budget
`interpolation.py`	Interpolate encoder quality settings and output metrics from measurement data
`analysis.py`	Generate plots and statistics from quality results
`comparison.py`	Generate side-by-side visual comparison figures via interpolation-based quality matching
`interactive.py`	Build interactive HTML report
`report_images.py`	Generate report visualisation assets

Each module is independently testable. Scripts in scripts/ provide CLI entry points that compose these modules.

For guidance on extending the codebase with new formats or metrics, see Extend formats and metrics.