Skip to content

Architecture and design decisions

The pipeline processes images one at a time, completing all encoding variants and quality measurements for each image before moving to the next:

For each image (until time budget exhausted):
└─ Encode all variants (formats × quality levels)
└─ Measure all variants (SSIMULACRA2, Butteraugli, PSNR, SSIM)
└─ Save results to disk

This “atomic per-image” design was chosen over the alternative of encoding all images first, then measuring all:

  • Partial results are always usable. If a study is interrupted or the time budget expires, every completed image has full encoding + measurement data. There are no orphaned encodings without metrics.
  • Error isolation. A failure on one image (e.g., an encoder crash on a specific input) does not block other images.

Studies are configured with a time budget (e.g., 30m, 2h) rather than a fixed image count. The pipeline encodes images from the dataset until the budget expires, then stops. This is more practical than guessing how many images to process:

  • Different encoder/quality combinations have wildly different speeds (AVIF speed 0 is ~100× slower than speed 10).
  • Multi-format studies multiply the per-image time by the number of variants.
  • The user sets a wall-clock time they’re willing to wait, and gets as many data points as fit.

Each image is processed by a single worker thread. The --workers flag controls parallelism. Because the encoding tools (cjpeg, cwebp, avifenc, cjxl) are CPU-intensive native binaries, the bottleneck is CPU time, and Python’s GIL does not limit throughput here — subprocess calls release the GIL.

A study is the central unit of work. Each study is defined by a JSON config file in config/studies/ that specifies:

  • Which dataset to use
  • Which formats and parameter ranges to encode
  • Study metadata (name, description)

The study ID (filename without .json) is used everywhere: directory names under data/encoded/<study-id>/, data/metrics/<study-id>/, data/analysis/<study-id>/, and in CLI commands (just pipeline <study-id> <time-budget>).

This design means:

  • Adding a new experiment is just creating a new JSON file — no code changes.
  • Reproducibility — the config file fully describes what was run.
  • Multiple studies coexist — each study’s outputs are isolated in its own subdirectory.

All data lives under data/, strictly separated from code:

data/
├── datasets/ # Raw downloads (from just fetch)
├── preprocessed/ # Resolution-scaled images (per study)
├── encoded/ # Compressed images (per study, per format)
├── metrics/ # Quality measurements (per study)
├── analysis/ # Plots and statistics (per study)
└── report/ # Generated HTML reports

Everything under data/ is git-ignored (except .gitkeep markers). This means:

  • The repository stays small regardless of how many studies are run.
  • Datasets can be re-downloaded; encoded/metrics/analysis can be regenerated.
  • Code changes are cleanly separated from data changes in version control.

Pipeline parameters live in JSON config files rather than being hardcoded:

  • config/datasets.json — dataset sources, URLs, storage types
  • config/studies/*.json — study definitions (formats, quality ranges, speeds)
  • JSON schemas (config/*.schema.json) — validate configs at load time

There are no “future” config files for preprocessing, quality, or analysis — those behaviours are determined by the study config and the source code defaults. This keeps the config surface small and avoids speculative abstraction.

See Add a custom dataset and Create a custom study for guides on extending the configuration.

FormatRoleWhy included
JPEGBaselineUniversal support, well-understood, the format everything is compared against
WebPEstablished alternativeBroad browser support, good compression, Google-backed
AVIFPrimary research targetBased on AV1, excellent low-bitrate compression, the main focus of this project
JPEG XLNext-generationStrong technical merits (progressive decode, lossless round-trip), limited browser support

The project measures both perceptual and traditional metrics:

MetricTypeWhy
SSIMULACRA2PerceptualDesigned specifically for lossy compression evaluation; most accurate for this use case
ButteraugliPerceptualModels human visual system with a different mathematical approach; complements SSIMULACRA2
PSNRTraditionalSimple, widely used, easy to compare with published literature
SSIMTraditionalBetter than PSNR for structural comparison, well-known baseline

Perceptual metrics (SSIMULACRA2, Butteraugli) are prioritised in analysis because they correlate better with human judgement than PSNR/SSIM for compression artifacts.

The encoding tools (avifenc, cjxl, cwebp, cjpeg) and measurement tools (ssimulacra2, butteraugli_main, avifdec, djxl) require specific builds. A dev container ensures:

  • Reproducibility — everyone gets the exact same tool versions.
  • No host pollution — build dependencies don’t touch the host system.
  • CI parity — the same image runs in CI and locally.

This is a research project, not a distributable library. Targeting a single Python version (3.13) simplifies testing, avoids compatibility workarounds, and lets the code use the latest language features without conditional logic.

All dependencies are declared in pyproject.toml:

  • pip install -e . — production dependencies
  • pip install -e ".[dev]" — adds pytest, mypy, ruff, type stubs

There is no requirements.txt. The pyproject is the single source of truth.

The comparison.py module is decoupled from the main pipeline. The pipeline is a pure encode-and-measure step that writes quality.json; the comparison module reads that file and independently re-encodes images to produce its figures.

This separation has several benefits:

  • Tunable without re-running the pipeline — comparison targets, tile parameter, and excluded images can be changed in the study config and the comparison regenerated without a costly re-run of the encoding pipeline.
  • No pipeline artefacts required — the comparison script re-encodes from the original dataset images using interpolated quality settings, so the pipeline no longer needs to save encoded artefacts to disk.
  • Richer figure types — because quality settings are interpolated on the fly, the comparison can produce figures at arbitrary target metric values (e.g., SSIMULACRA2 = 60, 75, 90) or file-size targets (e.g., bits_per_pixel = 0.5, 1.0, 1.5) regardless of which quality levels the pipeline swept.

The comparison module uses two layers of selection to maximise the visual informativeness of each figure:

  1. Cross-format CV (src/interpolation.py:select_best_image) — the source image that shows the highest relative spread (coefficient of variation = std / mean) of the output metric across encoding variants is selected. Using CV instead of raw variance avoids bias towards inherently brighter or higher-quality images.
  2. Anisotropic std map — across all target values in a group, per-pixel Butteraugli distortion maps are aggregated into a single anisotropic standard-deviation map. The crop region with the highest mean std is chosen. The same fragment is shared across all target values in the group, making visual differences directly comparable.
  1. Lint & type check — runs on bare Ubuntu with Python 3.13 for fast feedback.
  2. Build image — builds the dev container and pushes to GHCR with layer caching.
  3. Test suite — runs inside the built container (depends on both above), ensuring encoding/measurement tools are available for integration tests.
  4. Markdown lint — runs independently in parallel.

Running tests inside the dev container (rather than on bare Ubuntu) ensures that all native tools are available and integration tests produce accurate results.

Beyond CI, the project also uses GitHub Actions as a public research platform — see Public research with GitHub Actions for details.

The src/ modules map to pipeline stages and post-processing:

ModulePurpose
study.pyLoad and validate study configs
dataset.pyFetch and manage image datasets
preprocessing.pyResize images by longest edge
encoder.pyEncode images via subprocess calls to native tools
quality.pyMeasure quality metrics via subprocess calls
pipeline.pyOrchestrate encode → measure per image with time budget
interpolation.pyInterpolate encoder quality settings and output metrics from measurement data
analysis.pyGenerate plots and statistics from quality results
comparison.pyGenerate side-by-side visual comparison figures via interpolation-based quality matching
interactive.pyBuild interactive HTML report
report_images.pyGenerate report visualisation assets

Each module is independently testable. Scripts in scripts/ provide CLI entry points that compose these modules.

For guidance on extending the codebase with new formats or metrics, see Extend formats and metrics.