Skip to content

Data directory structure

All research data is stored under data/, organized by pipeline stage:

data/
├── datasets/ # Raw image datasets (from just fetch)
├── preprocessed/ # Preprocessed images (per study, per preprocessing mode)
├── encoded/ # Encoded images (per study, per format)
├── metrics/ # Quality measurements (per study)
├── analysis/ # Analysis plots and statistics (per study)
└── report/ # Generated HTML reports

All subdirectories are git-ignored except .gitkeep files that preserve directory structure.

Raw, unmodified datasets from external sources.

Generated by: scripts/fetch_dataset.py (just fetch <dataset-id>)

data/datasets/
├── DIV2K_valid/
│ ├── 0801.png
│ └── ...
├── DIV2K_train/
├── LIU4K_valid/
├── LIU4K_train/
└── UHD_IQA/

Typical size: 500 MB – 10 GB depending on datasets fetched.

Images after preprocessing, organized by study and preprocessing mode.

Generated by: scripts/run_pipeline.py (automatically, when an encoder config specifies resolution or crop)

data/preprocessed/
├── resolution-impact/ # Study ID
│ ├── r640/
│ │ ├── 0801_r640.png
│ │ └── ...
│ └── ...
└── avif-crop-impact/
├── c400/
│ ├── 0801_c400.png
│ └── ...
└── ...
  • r<pixels> directories contain resized inputs with that longest edge
  • c<pixels> directories contain cropped inputs whose longest edge is that value
  • Crop studies keep a fixed analysis fragment and vary the surrounding area
  • Only created when a study uses preprocessing; studies at original resolution skip this directory

Compressed images in target formats, organized by study.

Generated by: scripts/run_pipeline.py (just pipeline <study-id> <time-budget>)

data/encoded/
└── format-comparison/
├── results.json # Encoding results metadata
├── jpeg/
│ └── original/
│ ├── 0801_q75.jpg
│ └── ...
├── webp/
│ └── original/
├── avif/
│ └── original/
│ ├── 0801_q60_420_s4.avif
│ └── ...
└── jxl/
└── original/

Naming conventions:

ComponentPatternExample
Study IDtop-level directoryformat-comparison/
Formatsubdirectoryavif/
Preprocessing leveloriginal/, r<pixels>/, or c<pixels>/r1920/, c800/
JPEG filename<name>_q<quality>.jpg0801_q75.jpg
WebP filename<name>_q<quality>.webp0801_q85.webp
AVIF filename<name>_q<quality>_<chroma>_s<speed>.avif0801_q60_420_s4.avif
JXL filename<name>_q<quality>.jxl0801_q85.jxl

results.json — records every encoding with source paths, parameters, dimensions, and file sizes. See Configuration reference for the full schema.

Quality measurements per study.

Generated by: scripts/run_pipeline.py (measured immediately after encoding each image)

data/metrics/
└── format-comparison/
└── quality.json

Each quality.json contains one entry per encoding with:

  • All fields from the encoding result
  • crop, analysis_fragment, and crop_region for crop-impact studies
  • ssimulacra2, psnr, ssim, butteraugli scores
  • measurement_error (null on success, error string on failure)

For metric interpretation, see Tools reference.

Analysis plots and statistics per study.

Generated by: scripts/analyze_study.py (just analyze <study-id>)

data/analysis/
└── format-comparison/
├── format-comparison_rate_distortion.png
├── format-comparison_speed_analysis.png
├── format-comparison_statistics.csv
├── format-comparison_quality_distribution.png
└── comparison/
└── ssimulacra2/
├── comparison_70.webp
├── distortion_map_comparison_70.webp
├── distortion_map_anisotropic.webp
└── original_annotated.webp

Output files are named <study-id>_<plot-type>.<ext>. Comparison outputs live under comparison/, grouped by target metric and, when needed, additional split directories such as r720/ or c800/.

Generated HTML report with embedded visualizations.

Generated by: scripts/generate_report.py (just report)

data/report/
├── index.html
└── assets/
└── ...

Serve locally with just serve-report.

DirectoryTypical sizeNotes
datasets/500 MB – 10 GBDepends on datasets fetched
preprocessed/VariesOnly for studies with resolution or crop preprocessing
encoded/2–5× datasetMultiple formats and quality levels
metrics/< 100 MBJSON files
analysis/< 100 MBPNG plots and CSV statistics
report/< 50 MBSingle HTML page with assets
Terminal window
just clean-study format-comparison # Remove one study's data (preserves datasets)
just clean-studies # Remove all study data (preserves datasets)

For manual fine-grained cleanup:

Terminal window
rm -rf data/metrics/format-comparison # Re-measure only
rm -rf data/encoded/format-comparison/jpeg/* # Re-encode one format

There is no command to delete raw datasets — remove them manually with rm -rf data/datasets/<folder>.

Terminal window
# Backup metrics (small and valuable)
tar czf metrics_backup.tar.gz data/metrics/
# Backup key plots
tar czf plots_backup.tar.gz data/analysis/plots/

Understanding the data flow helps with debugging and reproducibility:

datasets → [preprocessed] → encoded → metrics → analysis
↓ ↓ ↓ ↓ ↓
config/studies/<study>.json
encoding results.json
quality results.json

Pipeline Flow:

  1. Dataset: Raw images downloaded from external sources
  2. Preprocessing (optional): Images resized according to study config
  3. Encoding: Images encoded per study configuration → results.json
  4. Quality Measurement: Metrics measured from results.jsonquality.json
  5. Analysis: Visualizations and reports from quality.json

Each stage:

  • Reads configuration from config/studies/<study-id>.json
  • Reads input from previous stage’s output
  • Writes to its own directory under data/<stage>/<study-id>/
  • Can be re-run independently if previous stage outputs exist