Skip to content

dataset

Dataset fetching and management module.

This module handles downloading and organizing test image datasets for the research project. It provides an extensible architecture for supporting multiple dataset sources through a JSON configuration file.


Configuration for a dataset.

__init__(
id: str,
name: str,
description: str,
type: str,
url: str,
size_mb: float,
image_count: int,
resolution: str,
format: str,
extracted_folder: str | None = None,
rename_to: str | None = None,
license: str | None = None,
source: str | None = None,
storage_type: str = 'direct',
folder_id: str | None = None,
post_process: str | None = None
) → None

from_dict(data: dict) → DatasetConfig

Create DatasetConfig from dictionary.

Args:

  • data: Dictionary with dataset configuration

Returns: DatasetConfig instance


Handles fetching and managing image datasets.

This class provides methods for downloading datasets from various sources, extracting archives, and managing the local dataset storage. It uses a JSON configuration file to define available datasets.

__init__(base_dir: Path, config_file: Path | None = None) → None

Initialize the dataset fetcher.

Args:

  • base_dir: Base directory where datasets will be stored
  • config_file: Path to datasets.json configuration file. If None, looks for config/datasets.json.

download_file(
url: str,
output_path: Path,
description: str | None = None
) → bool

Download a file from URL with progress bar.

Args:

  • url: URL of the file to download
  • output_path: Path where the file will be saved
  • description: Optional description for the progress bar

Returns: True if download succeeded, False otherwise


download_from_dropbox(
url: str,
output_path: Path,
description: str | None = None
) → bool

Download a file from Dropbox.

Args:

  • url: Dropbox sharing URL
  • output_path: Path where the file will be saved
  • description: Optional description for the progress bar

Returns: True if download succeeded, False otherwise


download_from_google_drive(
url: str,
output_path: Path,
description: str | None = None,
is_folder: bool = False
) → bool

Download a file or folder from Google Drive.

Args:

  • url: Google Drive URL or file ID
  • output_path: Path where the file/folder will be saved
  • description: Optional description for download
  • is_folder: Whether this is a folder download

Returns: True if download succeeded, False otherwise


extract_archive(archive_path: Path, extract_dir: Path) → bool

Extract a ZIP or TAR archive.

Args:

  • archive_path: Path to the archive file
  • extract_dir: Directory where contents will be extracted

Returns: True if extraction succeeded, False otherwise


extract_multipart_zips(dataset_dir: Path) → bool

Extract multi-part zip archives in a directory.

LIU4K v2 datasets use multi-part zips (.zip, .z01, .z02, etc.) organized by category. This method finds and extracts all such archives.

Args:

  • dataset_dir: Directory containing multi-part zip files

Returns: True if all extractions succeeded, False otherwise


extract_zips(dataset_dir: Path) → bool

Extract single-file zip archives in a directory.

LIU4K v1 datasets use single zip files containing all images. This method finds and extracts all such archives.

Args:

  • dataset_dir: Directory containing zip files

Returns: True if all extractions succeeded, False otherwise


fetch_dataset(dataset_id: str, cleanup_archive: bool = True) → Path | None

Fetch a dataset using its configuration.

Args:

  • dataset_id: Dataset identifier from datasets.json
  • cleanup_archive: Whether to delete the downloaded archive after extraction

Returns: Path to the extracted dataset directory, or None if fetch failed


fetch_div2k(
split: Literal['train', 'valid'] = 'valid',
cleanup_archive: bool = True
) → Path | None

Fetch the DIV2K dataset.

This is a convenience method that maps to the configuration-based fetch.

Args:

  • split: Dataset split to download - “train” or “valid”
  • cleanup_archive: Whether to delete the downloaded archive after extraction

Returns: Path to the extracted dataset directory, or None if fetch failed


get_dataset_config(dataset_id: str) → DatasetConfig | None

Get configuration for a specific dataset.

Args:

  • dataset_id: Dataset identifier

Returns: DatasetConfig if found, None otherwise


get_dataset_info(dataset_name: str) → dict[str, int | Path] | None

Get information about a downloaded dataset.

Args:

  • dataset_name: Name of the dataset directory

Returns: Dictionary with dataset information (path, image count), or None if not found


list_available_datasets() → list[DatasetConfig]

List all datasets available in configuration.

Returns: List of DatasetConfig objects for available datasets


list_datasets() → list[str]

List all available datasets.

Returns: List of dataset names (directory names in the base directory)