Data directory structure
Overview
Section titled “Overview”All research data is stored under data/, organized by pipeline stage:
data/├── datasets/ # Raw image datasets (from just fetch)├── preprocessed/ # Preprocessed images (per study, per preprocessing mode)├── encoded/ # Encoded images (per study, per format)├── metrics/ # Quality measurements (per study)├── analysis/ # Analysis plots and statistics (per study)└── report/ # Generated HTML reportsAll subdirectories are git-ignored except .gitkeep files that preserve directory structure.
data/datasets/
Section titled “data/datasets/”Raw, unmodified datasets from external sources.
Generated by: scripts/fetch_dataset.py (just fetch <dataset-id>)
data/datasets/├── DIV2K_valid/│ ├── 0801.png│ └── ...├── DIV2K_train/├── LIU4K_valid/├── LIU4K_train/└── UHD_IQA/Typical size: 500 MB – 10 GB depending on datasets fetched.
data/preprocessed/
Section titled “data/preprocessed/”Images after preprocessing, organized by study and preprocessing mode.
Generated by: scripts/run_pipeline.py (automatically, when an encoder config specifies resolution or crop)
data/preprocessed/├── resolution-impact/ # Study ID│ ├── r640/│ │ ├── 0801_r640.png│ │ └── ...│ └── ...└── avif-crop-impact/ ├── c400/ │ ├── 0801_c400.png │ └── ... └── ...r<pixels>directories contain resized inputs with that longest edgec<pixels>directories contain cropped inputs whose longest edge is that value- Crop studies keep a fixed analysis fragment and vary the surrounding area
- Only created when a study uses preprocessing; studies at original resolution skip this directory
data/encoded/
Section titled “data/encoded/”Compressed images in target formats, organized by study.
Generated by: scripts/run_pipeline.py (just pipeline <study-id> <time-budget>)
data/encoded/└── format-comparison/ ├── results.json # Encoding results metadata ├── jpeg/ │ └── original/ │ ├── 0801_q75.jpg │ └── ... ├── webp/ │ └── original/ ├── avif/ │ └── original/ │ ├── 0801_q60_420_s4.avif │ └── ... └── jxl/ └── original/Naming conventions:
| Component | Pattern | Example |
|---|---|---|
| Study ID | top-level directory | format-comparison/ |
| Format | subdirectory | avif/ |
| Preprocessing level | original/, r<pixels>/, or c<pixels>/ | r1920/, c800/ |
| JPEG filename | <name>_q<quality>.jpg | 0801_q75.jpg |
| WebP filename | <name>_q<quality>.webp | 0801_q85.webp |
| AVIF filename | <name>_q<quality>_<chroma>_s<speed>.avif | 0801_q60_420_s4.avif |
| JXL filename | <name>_q<quality>.jxl | 0801_q85.jxl |
results.json — records every encoding with source paths, parameters, dimensions, and file sizes. See Configuration reference for the full schema.
data/metrics/
Section titled “data/metrics/”Quality measurements per study.
Generated by: scripts/run_pipeline.py (measured immediately after encoding each image)
data/metrics/└── format-comparison/ └── quality.jsonEach quality.json contains one entry per encoding with:
- All fields from the encoding result
crop,analysis_fragment, andcrop_regionfor crop-impact studiesssimulacra2,psnr,ssim,butteraugliscoresmeasurement_error(null on success, error string on failure)
For metric interpretation, see Tools reference.
data/analysis/
Section titled “data/analysis/”Analysis plots and statistics per study.
Generated by: scripts/analyze_study.py (just analyze <study-id>)
data/analysis/└── format-comparison/ ├── format-comparison_rate_distortion.png ├── format-comparison_speed_analysis.png ├── format-comparison_statistics.csv ├── format-comparison_quality_distribution.png └── comparison/ └── ssimulacra2/ ├── comparison_70.webp ├── distortion_map_comparison_70.webp ├── distortion_map_anisotropic.webp └── original_annotated.webpOutput files are named <study-id>_<plot-type>.<ext>.
Comparison outputs live under comparison/, grouped by target metric and, when
needed, additional split directories such as r720/ or c800/.
data/report/
Section titled “data/report/”Generated HTML report with embedded visualizations.
Generated by: scripts/generate_report.py (just report)
data/report/├── index.html└── assets/ └── ...Serve locally with just serve-report.
Disk space
Section titled “Disk space”| Directory | Typical size | Notes |
|---|---|---|
datasets/ | 500 MB – 10 GB | Depends on datasets fetched |
preprocessed/ | Varies | Only for studies with resolution or crop preprocessing |
encoded/ | 2–5× dataset | Multiple formats and quality levels |
metrics/ | < 100 MB | JSON files |
analysis/ | < 100 MB | PNG plots and CSV statistics |
report/ | < 50 MB | Single HTML page with assets |
Cleanup
Section titled “Cleanup”just clean-study format-comparison # Remove one study's data (preserves datasets)just clean-studies # Remove all study data (preserves datasets)For manual fine-grained cleanup:
rm -rf data/metrics/format-comparison # Re-measure onlyrm -rf data/encoded/format-comparison/jpeg/* # Re-encode one formatThere is no command to delete raw datasets — remove them manually with rm -rf data/datasets/<folder>.
# Backup metrics (small and valuable)tar czf metrics_backup.tar.gz data/metrics/
# Backup key plotstar czf plots_backup.tar.gz data/analysis/plots/Data Lineage
Section titled “Data Lineage”Understanding the data flow helps with debugging and reproducibility:
datasets → [preprocessed] → encoded → metrics → analysis ↓ ↓ ↓ ↓ ↓ config/studies/<study>.json ↓ encoding results.json ↓ quality results.jsonPipeline Flow:
- Dataset: Raw images downloaded from external sources
- Preprocessing (optional): Images resized according to study config
- Encoding: Images encoded per study configuration →
results.json - Quality Measurement: Metrics measured from
results.json→quality.json - Analysis: Visualizations and reports from
quality.json
Each stage:
- Reads configuration from
config/studies/<study-id>.json - Reads input from previous stage’s output
- Writes to its own directory under
data/<stage>/<study-id>/ - Can be re-run independently if previous stage outputs exist
See Also
Section titled “See Also”- Configuration Files Reference - Config file formats
- Architecture and Design - Design decisions
- How to Fetch Datasets - Practical usage