pipeline
module src.pipeline
Section titled “module src.pipeline”Merged encode+measure pipeline with time-budget support.
This module provides a unified pipeline that processes images with a worker-per-image architecture, encoding and measuring quality in a single pass. This eliminates the need to store intermediate encoded files on disk and allows time-budget-based processing where the pipeline processes as many images as possible within a given time constraint.
Architecture:
- Each worker processes one complete image at a time (all encoding tasks sequentially)
- Workers pull the next image from a queue when they finish
- This keeps all workers fully utilized throughout the pipeline
- Memory-intensive operations are naturally staggered across workers, reducing peak memory usage
Key advantages over the separate encode → measure workflow:
- Time-budget control: Set a wall-clock time limit instead of guessing how many images to process. The pipeline processes as many images as possible within the budget.
- Full worker utilization: Workers always have work available, no idle time waiting for other tasks to complete.
- Reduced peak memory: Tasks are staggered across workers rather than synchronized, preventing memory spikes from parallel execution of memory-intensive tools.
- Reduced disk IO: Encoded files are written to temporary storage and cleaned up after measurement. Optional
save_artifactsflag persists them to disk. - Per-image error isolation: All operations for one image are grouped within a single worker. If encoding or measurement fails, the worker logs the error and moves to the next image.
Global Variables
Section titled “Global Variables”- UTC
function parse_time_budget
Section titled “function parse_time_budget”parse_time_budget(value: str) → floatParse a human-readable time budget string into seconds.
Accepted formats:
- Plain number: interpreted as seconds (
"3600"→ 3600.0) - Duration suffixes:
"1h","30m","90s","1h30m","2h15m30s"
Args:
value: Time budget string.
Returns: Duration in seconds.
Raises:
ValueError: If the format cannot be parsed.
class PipelineRunner
Section titled “class PipelineRunner”Merged encode+measure pipeline with time-budget support.
Uses a worker-per-image architecture where each worker processes one complete image before moving to the next. For each image, the worker:
- Preprocesses (resize) for every configured resolution. 2. Encodes all parameter combinations sequentially. 3. Measures quality of each encoded variant. 4. Pulls the next image from the queue if time budget allows.
This architecture keeps all workers fully utilized and naturally staggers memory-intensive operations across workers, reducing peak memory usage.
Time budget behavior:
- Initial batch fills all available workers (max throughput at start)
- Budget is checked before submitting additional images
- When budget expires, new submissions stop but in-flight work completes
- Note: In-flight images process sequentially on their assigned workers, which may leave some workers idle during the finish phase. A future optimization could switch to task-level parallelism after budget expiry.
Encoded files live in a temporary directory and are discarded after measurement unless save_artifacts=True.
method __init__
Section titled “method __init__”__init__(project_root: Path) → Nonemethod run
Section titled “method run”run( config: StudyConfig, time_budget: float | None = None, save_artifacts: bool = False, num_workers: int | None = None) → QualityResultsRun the merged encode+measure pipeline.
Args:
config: Study configuration describing dataset, encoders, and optional preprocessing.time_budget: Maximum wall-clock seconds to spend. When set, the pipeline processes images until this budget is exhausted (always completing the current image).Nonemeans process all available images.save_artifacts: IfTrue, persist encoded files todata/encoded/<study_id>/.num_workers: Parallel workers (default: CPU count).
Returns:
:class:QualityResultsready for analysis / report.
Raises:
ValueError: If dataset is not found in configuration.FileNotFoundError: If dataset is not downloaded or has no images.