Add a custom dataset
Adding a dataset only requires editing config/datasets.json.
No Python code changes are needed.
Add a direct-download dataset
Section titled “Add a direct-download dataset”Open config/datasets.json and add an entry to the datasets array:
{ "id": "my-dataset", "name": "My Custom Dataset", "description": "A collection of high-resolution test images", "type": "zip", "url": "https://example.com/my-dataset.zip", "size_mb": 200, "image_count": 50, "resolution": "2K", "format": "PNG", "extracted_folder": "my-dataset-images", "license": "CC BY 4.0", "source": "My Organization"}Required fields
Section titled “Required fields”| Field | Description |
|---|---|
id | Unique identifier (lowercase, hyphens, digits only — pattern ^[a-z0-9-]+$). Used in CLI commands. |
name | Human-readable name. |
description | Brief description. |
type | Archive format: zip, tar, tar.gz, tgz, or folder (Google Drive folders). |
url | Download URL (HTTP/HTTPS). |
size_mb | Approximate download size in MB. |
image_count | Number of images. |
resolution | Resolution description (e.g., "2K", "4K"). |
format | Image format (e.g., "PNG", "TIFF"). |
Optional fields
Section titled “Optional fields”| Field | Description |
|---|---|
storage_type | "direct" (default), "google_drive", or "dropbox". |
folder_id | Google Drive folder ID (for "google_drive" storage type). |
post_process | "extract_zips" or "extract_multipart_zips" (for nested archives). |
extracted_folder | Folder name inside the archive after extraction. |
rename_to | Rename extracted folder to this name. |
license | License information. |
source | Organization providing the dataset. |
Add a Google Drive dataset
Section titled “Add a Google Drive dataset”For datasets hosted on Google Drive, set storage_type and provide the
folder or file details:
{ "id": "my-gdrive-dataset", "name": "My Google Drive Dataset", "description": "4K images from Google Drive", "type": "folder", "url": "https://drive.google.com/drive/folders/FOLDER_ID", "storage_type": "google_drive", "folder_id": "FOLDER_ID", "post_process": "extract_zips", "size_mb": 1300, "image_count": 80, "resolution": "4K", "format": "PNG", "license": "CC BY-NC-ND 4.0", "source": "Research Group"}Verify the dataset
Section titled “Verify the dataset”Fetch the new dataset:
just fetch my-datasetCheck that images are in place:
ls data/datasets/my-dataset/You can also verify all downloaded datasets:
python3 scripts/fetch_dataset.py --show-downloadedImage format recommendations
Section titled “Image format recommendations”- Use lossless source images (PNG, TIFF) for unbiased format comparison. JPEG sources have pre-existing compression artifacts that skew quality metrics.
- Include at least 50 images for statistical significance in analysis.
- Consistent resolution within a dataset makes comparison cleaner,
though the
resolutionencoder parameter can normalize images before encoding.
Schema validation
Section titled “Schema validation”The configuration is validated against config/datasets.schema.json.
IDEs with JSON Schema support (including VS Code in the dev container)
will provide autocompletion and inline validation when you edit the file.
See also
Section titled “See also”- Datasets reference — properties and licensing for built-in datasets
- Configuration reference — full
datasets.jsonschema - Fetch datasets — download commands and troubleshooting