dataset
module src.dataset
Section titled “module src.dataset”Dataset fetching and management module.
This module handles downloading and organizing test image datasets for the research project. It provides an extensible architecture for supporting multiple dataset sources through a JSON configuration file.
class DatasetConfig
Section titled “class DatasetConfig”Configuration for a dataset.
method __init__
Section titled “method __init__”__init__( id: str, name: str, description: str, type: str, url: str, size_mb: float, image_count: int, resolution: str, format: str, extracted_folder: str | None = None, rename_to: str | None = None, license: str | None = None, source: str | None = None, storage_type: str = 'direct', folder_id: str | None = None, post_process: str | None = None) → Noneclassmethod from_dict
Section titled “classmethod from_dict”from_dict(data: dict) → DatasetConfigCreate DatasetConfig from dictionary.
Args:
data: Dictionary with dataset configuration
Returns: DatasetConfig instance
class DatasetFetcher
Section titled “class DatasetFetcher”Handles fetching and managing image datasets.
This class provides methods for downloading datasets from various sources, extracting archives, and managing the local dataset storage. It uses a JSON configuration file to define available datasets.
method __init__
Section titled “method __init__”__init__(base_dir: Path, config_file: Path | None = None) → NoneInitialize the dataset fetcher.
Args:
base_dir: Base directory where datasets will be storedconfig_file: Path to datasets.json configuration file. If None, looks for config/datasets.json.
method download_file
Section titled “method download_file”download_file( url: str, output_path: Path, description: str | None = None) → boolDownload a file from URL with progress bar.
Args:
url: URL of the file to downloadoutput_path: Path where the file will be saveddescription: Optional description for the progress bar
Returns: True if download succeeded, False otherwise
method download_from_dropbox
Section titled “method download_from_dropbox”download_from_dropbox( url: str, output_path: Path, description: str | None = None) → boolDownload a file from Dropbox.
Args:
url: Dropbox sharing URLoutput_path: Path where the file will be saveddescription: Optional description for the progress bar
Returns: True if download succeeded, False otherwise
method download_from_google_drive
Section titled “method download_from_google_drive”download_from_google_drive( url: str, output_path: Path, description: str | None = None, is_folder: bool = False) → boolDownload a file or folder from Google Drive.
Args:
url: Google Drive URL or file IDoutput_path: Path where the file/folder will be saveddescription: Optional description for downloadis_folder: Whether this is a folder download
Returns: True if download succeeded, False otherwise
method extract_archive
Section titled “method extract_archive”extract_archive(archive_path: Path, extract_dir: Path) → boolExtract a ZIP or TAR archive.
Args:
archive_path: Path to the archive fileextract_dir: Directory where contents will be extracted
Returns: True if extraction succeeded, False otherwise
method extract_multipart_zips
Section titled “method extract_multipart_zips”extract_multipart_zips(dataset_dir: Path) → boolExtract multi-part zip archives in a directory.
LIU4K v2 datasets use multi-part zips (.zip, .z01, .z02, etc.) organized by category. This method finds and extracts all such archives.
Args:
dataset_dir: Directory containing multi-part zip files
Returns: True if all extractions succeeded, False otherwise
method extract_zips
Section titled “method extract_zips”extract_zips(dataset_dir: Path) → boolExtract single-file zip archives in a directory.
LIU4K v1 datasets use single zip files containing all images. This method finds and extracts all such archives.
Args:
dataset_dir: Directory containing zip files
Returns: True if all extractions succeeded, False otherwise
method fetch_dataset
Section titled “method fetch_dataset”fetch_dataset(dataset_id: str, cleanup_archive: bool = True) → Path | NoneFetch a dataset using its configuration.
Args:
dataset_id: Dataset identifier from datasets.jsoncleanup_archive: Whether to delete the downloaded archive after extraction
Returns: Path to the extracted dataset directory, or None if fetch failed
method fetch_div2k
Section titled “method fetch_div2k”fetch_div2k( split: Literal['train', 'valid'] = 'valid', cleanup_archive: bool = True) → Path | NoneFetch the DIV2K dataset.
This is a convenience method that maps to the configuration-based fetch.
Args:
split: Dataset split to download - “train” or “valid”cleanup_archive: Whether to delete the downloaded archive after extraction
Returns: Path to the extracted dataset directory, or None if fetch failed
method get_dataset_config
Section titled “method get_dataset_config”get_dataset_config(dataset_id: str) → DatasetConfig | NoneGet configuration for a specific dataset.
Args:
dataset_id: Dataset identifier
Returns: DatasetConfig if found, None otherwise
method get_dataset_info
Section titled “method get_dataset_info”get_dataset_info(dataset_name: str) → dict[str, int | Path] | NoneGet information about a downloaded dataset.
Args:
dataset_name: Name of the dataset directory
Returns: Dictionary with dataset information (path, image count), or None if not found
method list_available_datasets
Section titled “method list_available_datasets”list_available_datasets() → list[DatasetConfig]List all datasets available in configuration.
Returns: List of DatasetConfig objects for available datasets
method list_datasets
Section titled “method list_datasets”list_datasets() → list[str]List all available datasets.
Returns: List of dataset names (directory names in the base directory)