Architecture and design decisions
Pipeline design
Section titled “Pipeline design”The pipeline processes images one at a time, completing all encoding variants and quality measurements for each image before moving to the next:
For each image (until time budget exhausted): └─ Encode all variants (formats × quality levels) └─ Measure all variants (SSIMULACRA2, Butteraugli, PSNR, SSIM) └─ Save results to diskThis “atomic per-image” design was chosen over the alternative of encoding all images first, then measuring all:
- Partial results are always usable. If a study is interrupted or the time budget expires, every completed image has full encoding + measurement data. There are no orphaned encodings without metrics.
- Error isolation. A failure on one image (e.g., an encoder crash on a specific input) does not block other images.
Time-budget approach
Section titled “Time-budget approach”Studies are configured with a time budget (e.g., 30m, 2h) rather than a fixed image count. The pipeline encodes images from the dataset until the budget expires, then stops. This is more practical than guessing how many images to process:
- Different encoder/quality combinations have wildly different speeds (AVIF speed 0 is ~100× slower than speed 10).
- Multi-format studies multiply the per-image time by the number of variants.
- The user sets a wall-clock time they’re willing to wait, and gets as many data points as fit.
Worker model
Section titled “Worker model”Each image is processed by a single worker thread. The --workers flag controls parallelism. Because the encoding tools (cjpeg, cwebp, avifenc, cjxl) are CPU-intensive native binaries, the bottleneck is CPU time, and Python’s GIL does not limit throughput here — subprocess calls release the GIL.
Study system
Section titled “Study system”A study is the central unit of work. Each study is defined by a JSON config file in config/studies/ that specifies:
- Which dataset to use
- Which formats and parameter ranges to encode
- Study metadata (name, description)
The study ID (filename without .json) is used everywhere: directory names under data/encoded/<study-id>/, data/metrics/<study-id>/, data/analysis/<study-id>/, and in CLI commands (just pipeline <study-id> <time-budget>).
This design means:
- Adding a new experiment is just creating a new JSON file — no code changes.
- Reproducibility — the config file fully describes what was run.
- Multiple studies coexist — each study’s outputs are isolated in its own subdirectory.
Data separation
Section titled “Data separation”All data lives under data/, strictly separated from code:
data/├── datasets/ # Raw downloads (from just fetch)├── preprocessed/ # Resolution-scaled images (per study)├── encoded/ # Compressed images (per study, per format)├── metrics/ # Quality measurements (per study)├── analysis/ # Plots and statistics (per study)└── report/ # Generated HTML reportsEverything under data/ is git-ignored (except .gitkeep markers). This means:
- The repository stays small regardless of how many studies are run.
- Datasets can be re-downloaded; encoded/metrics/analysis can be regenerated.
- Code changes are cleanly separated from data changes in version control.
Configuration over code
Section titled “Configuration over code”Pipeline parameters live in JSON config files rather than being hardcoded:
config/datasets.json— dataset sources, URLs, storage typesconfig/studies/*.json— study definitions (formats, quality ranges, speeds)- JSON schemas (
config/*.schema.json) — validate configs at load time
There are no “future” config files for preprocessing, quality, or analysis — those behaviours are determined by the study config and the source code defaults. This keeps the config surface small and avoids speculative abstraction.
See Add a custom dataset and Create a custom study for guides on extending the configuration.
Format choices
Section titled “Format choices”| Format | Role | Why included |
|---|---|---|
| JPEG | Baseline | Universal support, well-understood, the format everything is compared against |
| WebP | Established alternative | Broad browser support, good compression, Google-backed |
| AVIF | Primary research target | Based on AV1, excellent low-bitrate compression, the main focus of this project |
| JPEG XL | Next-generation | Strong technical merits (progressive decode, lossless round-trip), limited browser support |
Metric choices
Section titled “Metric choices”The project measures both perceptual and traditional metrics:
| Metric | Type | Why |
|---|---|---|
| SSIMULACRA2 | Perceptual | Designed specifically for lossy compression evaluation; most accurate for this use case |
| Butteraugli | Perceptual | Models human visual system with a different mathematical approach; complements SSIMULACRA2 |
| PSNR | Traditional | Simple, widely used, easy to compare with published literature |
| SSIM | Traditional | Better than PSNR for structural comparison, well-known baseline |
Perceptual metrics (SSIMULACRA2, Butteraugli) are prioritised in analysis because they correlate better with human judgement than PSNR/SSIM for compression artifacts.
Dev container
Section titled “Dev container”The encoding tools (avifenc, cjxl, cwebp, cjpeg) and measurement tools (ssimulacra2, butteraugli_main, avifdec, djxl) require specific builds. A dev container ensures:
- Reproducibility — everyone gets the exact same tool versions.
- No host pollution — build dependencies don’t touch the host system.
- CI parity — the same image runs in CI and locally.
Python 3.13, single target
Section titled “Python 3.13, single target”This is a research project, not a distributable library. Targeting a single Python version (3.13) simplifies testing, avoids compatibility workarounds, and lets the code use the latest language features without conditional logic.
Dependency management
Section titled “Dependency management”All dependencies are declared in pyproject.toml:
pip install -e .— production dependenciespip install -e ".[dev]"— adds pytest, mypy, ruff, type stubs
There is no requirements.txt. The pyproject is the single source of truth.
Comparison module design
Section titled “Comparison module design”The comparison.py module is decoupled from the main pipeline. The pipeline
is a pure encode-and-measure step that writes quality.json; the comparison module
reads that file and independently re-encodes images to produce its figures.
This separation has several benefits:
- Tunable without re-running the pipeline — comparison targets, tile parameter, and excluded images can be changed in the study config and the comparison regenerated without a costly re-run of the encoding pipeline.
- No pipeline artefacts required — the comparison script re-encodes from the original dataset images using interpolated quality settings, so the pipeline no longer needs to save encoded artefacts to disk.
- Richer figure types — because quality settings are interpolated on the fly, the comparison can produce figures at arbitrary target metric values (e.g., SSIMULACRA2 = 60, 75, 90) or file-size targets (e.g., bits_per_pixel = 0.5, 1.0, 1.5) regardless of which quality levels the pipeline swept.
Image and fragment selection
Section titled “Image and fragment selection”The comparison module uses two layers of selection to maximise the visual informativeness of each figure:
- Cross-format CV (
src/interpolation.py:select_best_image) — the source image that shows the highest relative spread (coefficient of variation = std / mean) of the output metric across encoding variants is selected. Using CV instead of raw variance avoids bias towards inherently brighter or higher-quality images. - Anisotropic std map — across all target values in a group, per-pixel Butteraugli distortion maps are aggregated into a single anisotropic standard-deviation map. The crop region with the highest mean std is chosen. The same fragment is shared across all target values in the group, making visual differences directly comparable.
CI design
Section titled “CI design”- Lint & type check — runs on bare Ubuntu with Python 3.13 for fast feedback.
- Build image — builds the dev container and pushes to GHCR with layer caching.
- Test suite — runs inside the built container (depends on both above), ensuring encoding/measurement tools are available for integration tests.
- Markdown lint — runs independently in parallel.
Running tests inside the dev container (rather than on bare Ubuntu) ensures that all native tools are available and integration tests produce accurate results.
Beyond CI, the project also uses GitHub Actions as a public research platform — see Public research with GitHub Actions for details.
Module architecture
Section titled “Module architecture”The src/ modules map to pipeline stages and post-processing:
| Module | Purpose |
|---|---|
study.py | Load and validate study configs |
dataset.py | Fetch and manage image datasets |
preprocessing.py | Resize images by longest edge |
encoder.py | Encode images via subprocess calls to native tools |
quality.py | Measure quality metrics via subprocess calls |
pipeline.py | Orchestrate encode → measure per image with time budget |
interpolation.py | Interpolate encoder quality settings and output metrics from measurement data |
analysis.py | Generate plots and statistics from quality results |
comparison.py | Generate side-by-side visual comparison figures via interpolation-based quality matching |
interactive.py | Build interactive HTML report |
report_images.py | Generate report visualisation assets |
Each module is independently testable. Scripts in scripts/ provide CLI entry points that compose these modules.
For guidance on extending the codebase with new formats or metrics, see Extend formats and metrics.