Data directory structure

Overview

All research data is stored under data/, organized by pipeline stage:

data/
├── datasets/        # Raw image datasets (from just fetch)
├── preprocessed/    # Preprocessed images (per study, per preprocessing mode)
├── encoded/         # Encoded images (per study, per format)
├── metrics/         # Quality measurements (per study)
├── analysis/        # Analysis plots and statistics (per study)
└── report/          # Generated HTML reports

All subdirectories are git-ignored except .gitkeep files that preserve directory structure.

data/datasets/

Raw, unmodified datasets from external sources.

Generated by: scripts/fetch_dataset.py (just fetch <dataset-id>)

data/datasets/
├── DIV2K_valid/
│   ├── 0801.png
│   └── ...
├── DIV2K_train/
├── LIU4K_valid/
├── LIU4K_train/
└── UHD_IQA/

Typical size: 500 MB – 10 GB depending on datasets fetched.

data/preprocessed/

Images after preprocessing, organized by study and preprocessing mode.

Generated by: scripts/run_pipeline.py (automatically, when an encoder config specifies resolution or crop)

data/preprocessed/
├── resolution-impact/          # Study ID
│   ├── r640/
│   │   ├── 0801_r640.png
│   │   └── ...
│   └── ...
└── avif-crop-impact/
    ├── c400/
    │   ├── 0801_c400.png
    │   └── ...
    └── ...

r<pixels> directories contain resized inputs with that longest edge
c<pixels> directories contain cropped inputs whose longest edge is that value
Crop studies keep a fixed analysis fragment and vary the surrounding area
Only created when a study uses preprocessing; studies at original resolution skip this directory

data/encoded/

Compressed images in target formats, organized by study.

Generated by: scripts/run_pipeline.py (just pipeline <study-id> <time-budget>)

data/encoded/
└── format-comparison/
    ├── results.json           # Encoding results metadata
    ├── jpeg/
    │   └── original/
    │       ├── 0801_q75.jpg
    │       └── ...
    ├── webp/
    │   └── original/
    ├── avif/
    │   └── original/
    │       ├── 0801_q60_420_s4.avif
    │       └── ...
    └── jxl/
        └── original/

Naming conventions:

Component	Pattern	Example
Study ID	top-level directory	`format-comparison/`
Format	subdirectory	`avif/`
Preprocessing level	`original/`, `r<pixels>/`, or `c<pixels>/`	`r1920/`, `c800/`
JPEG filename	`<name>_q<quality>.jpg`	`0801_q75.jpg`
WebP filename	`<name>_q<quality>.webp`	`0801_q85.webp`
AVIF filename	`<name>_q<quality>_<chroma>_s<speed>.avif`	`0801_q60_420_s4.avif`
JXL filename	`<name>_q<quality>.jxl`	`0801_q85.jxl`

results.json — records every encoding with source paths, parameters, dimensions, and file sizes. See Configuration reference for the full schema.

data/metrics/

Quality measurements per study.

Generated by: scripts/run_pipeline.py (measured immediately after encoding each image)

data/metrics/
└── format-comparison/
    └── quality.json

Each quality.json contains one entry per encoding with:

All fields from the encoding result
crop, analysis_fragment, and crop_region for crop-impact studies
ssimulacra2, psnr, ssim, butteraugli scores
measurement_error (null on success, error string on failure)

For metric interpretation, see Tools reference.

data/analysis/

Analysis plots and statistics per study.

Generated by: scripts/analyze_study.py (just analyze <study-id>)

data/analysis/
└── format-comparison/
    ├── format-comparison_rate_distortion.png
    ├── format-comparison_speed_analysis.png
    ├── format-comparison_statistics.csv
    ├── format-comparison_quality_distribution.png
    └── comparison/
        └── ssimulacra2/
            ├── comparison_70.webp
            ├── distortion_map_comparison_70.webp
            ├── distortion_map_anisotropic.webp
            └── original_annotated.webp

Output files are named <study-id>_<plot-type>.<ext>. Comparison outputs live under comparison/, grouped by target metric and, when needed, additional split directories such as r720/ or c800/.

data/report/

Generated HTML report with embedded visualizations.

Generated by: scripts/generate_report.py (just report)

data/report/
├── index.html
└── assets/
    └── ...

Serve locally with just serve-report.

Disk space

Directory	Typical size	Notes
`datasets/`	500 MB – 10 GB	Depends on datasets fetched
`preprocessed/`	Varies	Only for studies with resolution or crop preprocessing
`encoded/`	2–5× dataset	Multiple formats and quality levels
`metrics/`	< 100 MB	JSON files
`analysis/`	< 100 MB	PNG plots and CSV statistics
`report/`	< 50 MB	Single HTML page with assets

Cleanup

just clean-study format-comparison   # Remove one study's data (preserves datasets)
just clean-studies                   # Remove all study data (preserves datasets)

For manual fine-grained cleanup:

rm -rf data/metrics/format-comparison    # Re-measure only
rm -rf data/encoded/format-comparison/jpeg/*  # Re-encode one format

There is no command to delete raw datasets — remove them manually with rm -rf data/datasets/<folder>.

# Backup metrics (small and valuable)
tar czf metrics_backup.tar.gz data/metrics/

# Backup key plots
tar czf plots_backup.tar.gz data/analysis/plots/

Data Lineage

Understanding the data flow helps with debugging and reproducibility:

datasets → [preprocessed] → encoded → metrics → analysis
   ↓             ↓             ↓          ↓         ↓
                            config/studies/<study>.json
                                    ↓
                            encoding results.json
                                    ↓
                            quality results.json

Pipeline Flow:

Dataset: Raw images downloaded from external sources
Preprocessing (optional): Images resized according to study config
Encoding: Images encoded per study configuration → results.json
Quality Measurement: Metrics measured from results.json → quality.json
Analysis: Visualizations and reports from quality.json

Each stage:

Reads configuration from config/studies/<study-id>.json
Reads input from previous stage’s output
Writes to its own directory under data/<stage>/<study-id>/
Can be re-run independently if previous stage outputs exist