dataset

`module` `src.dataset`

Dataset fetching and management module.

This module handles downloading and organizing test image datasets for the research project. It provides an extensible architecture for supporting multiple dataset sources through a JSON configuration file.

`class` `DatasetConfig`

Configuration for a dataset.

`method` `init`

__init__(
    id: str,
    name: str,
    description: str,
    type: str,
    url: str,
    size_mb: float,
    image_count: int,
    resolution: str,
    format: str,
    extracted_folder: str | None = None,
    rename_to: str | None = None,
    license: str | None = None,
    source: str | None = None,
    storage_type: str = 'direct',
    folder_id: str | None = None,
    post_process: str | None = None
) → None

`classmethod` `from_dict`

from_dict(data: dict) → DatasetConfig

Create DatasetConfig from dictionary.

Args:

data: Dictionary with dataset configuration

Returns: DatasetConfig instance

`class` `DatasetFetcher`

Handles fetching and managing image datasets.

This class provides methods for downloading datasets from various sources, extracting archives, and managing the local dataset storage. It uses a JSON configuration file to define available datasets.

`method` `init`

__init__(base_dir: Path, config_file: Path | None = None) → None

Initialize the dataset fetcher.

Args:

base_dir: Base directory where datasets will be stored
config_file: Path to datasets.json configuration file. If None, looks for config/datasets.json.

`method` `download_file`

download_file(
    url: str,
    output_path: Path,
    description: str | None = None
) → bool

Download a file from URL with progress bar.

Args:

url: URL of the file to download
output_path: Path where the file will be saved
description: Optional description for the progress bar

Returns: True if download succeeded, False otherwise

`method` `download_from_dropbox`

download_from_dropbox(
    url: str,
    output_path: Path,
    description: str | None = None
) → bool

Download a file from Dropbox.

Args:

url: Dropbox sharing URL
output_path: Path where the file will be saved
description: Optional description for the progress bar

Returns: True if download succeeded, False otherwise

`method` `download_from_google_drive`

download_from_google_drive(
    url: str,
    output_path: Path,
    description: str | None = None,
    is_folder: bool = False
) → bool

Download a file or folder from Google Drive.

Args:

url: Google Drive URL or file ID
output_path: Path where the file/folder will be saved
description: Optional description for download
is_folder: Whether this is a folder download

Returns: True if download succeeded, False otherwise

`method` `extract_archive`

extract_archive(archive_path: Path, extract_dir: Path) → bool

Extract a ZIP or TAR archive.

Args:

archive_path: Path to the archive file
extract_dir: Directory where contents will be extracted

Returns: True if extraction succeeded, False otherwise

`method` `extract_multipart_zips`

extract_multipart_zips(dataset_dir: Path) → bool

Extract multi-part zip archives in a directory.

LIU4K v2 datasets use multi-part zips (.zip, .z01, .z02, etc.) organized by category. This method finds and extracts all such archives.

Args:

dataset_dir: Directory containing multi-part zip files

Returns: True if all extractions succeeded, False otherwise

`method` `extract_zips`

extract_zips(dataset_dir: Path) → bool

Extract single-file zip archives in a directory.

LIU4K v1 datasets use single zip files containing all images. This method finds and extracts all such archives.

Args:

dataset_dir: Directory containing zip files

Returns: True if all extractions succeeded, False otherwise

`method` `fetch_dataset`

fetch_dataset(dataset_id: str, cleanup_archive: bool = True) → Path | None

Fetch a dataset using its configuration.

Args:

dataset_id: Dataset identifier from datasets.json
cleanup_archive: Whether to delete the downloaded archive after extraction

Returns: Path to the extracted dataset directory, or None if fetch failed

`method` `fetch_div2k`

fetch_div2k(
    split: Literal['train', 'valid'] = 'valid',
    cleanup_archive: bool = True
) → Path | None

Fetch the DIV2K dataset.

This is a convenience method that maps to the configuration-based fetch.

Args:

split: Dataset split to download - “train” or “valid”
cleanup_archive: Whether to delete the downloaded archive after extraction

Returns: Path to the extracted dataset directory, or None if fetch failed

`method` `get_dataset_config`

get_dataset_config(dataset_id: str) → DatasetConfig | None

Get configuration for a specific dataset.

Args:

dataset_id: Dataset identifier

Returns: DatasetConfig if found, None otherwise

`method` `get_dataset_info`

get_dataset_info(dataset_name: str) → dict[str, int | Path] | None

Get information about a downloaded dataset.

Args:

dataset_name: Name of the dataset directory

Returns: Dictionary with dataset information (path, image count), or None if not found

`method` `list_available_datasets`

list_available_datasets() → list[DatasetConfig]

List all datasets available in configuration.

Returns: List of DatasetConfig objects for available datasets

`method` `list_datasets`

list_datasets() → list[str]

List all available datasets.

Returns: List of dataset names (directory names in the base directory)

dataset

module src.dataset

class DatasetConfig

method __init__

classmethod from_dict

class DatasetFetcher

method __init__

method download_file

method download_from_dropbox

method download_from_google_drive

method extract_archive

method extract_multipart_zips

method extract_zips

method fetch_dataset

method fetch_div2k

method get_dataset_config

method get_dataset_info

method list_available_datasets

method list_datasets

`module` `src.dataset`

`class` `DatasetConfig`

`method` `init`

`classmethod` `from_dict`

`class` `DatasetFetcher`

`method` `init`

`method` `download_file`

`method` `download_from_dropbox`

`method` `download_from_google_drive`

`method` `extract_archive`

`method` `extract_multipart_zips`

`method` `extract_zips`

`method` `fetch_dataset`

`method` `fetch_div2k`

`method` `get_dataset_config`

`method` `get_dataset_info`

`method` `list_available_datasets`

`method` `list_datasets`