Skip to content

Add a custom dataset

Adding a dataset only requires editing config/datasets.json. No Python code changes are needed.

Open config/datasets.json and add an entry to the datasets array:

{
"id": "my-dataset",
"name": "My Custom Dataset",
"description": "A collection of high-resolution test images",
"type": "zip",
"url": "https://example.com/my-dataset.zip",
"size_mb": 200,
"image_count": 50,
"resolution": "2K",
"format": "PNG",
"extracted_folder": "my-dataset-images",
"license": "CC BY 4.0",
"source": "My Organization"
}
FieldDescription
idUnique identifier (lowercase, hyphens, digits only — pattern ^[a-z0-9-]+$). Used in CLI commands.
nameHuman-readable name.
descriptionBrief description.
typeArchive format: zip, tar, tar.gz, tgz, or folder (Google Drive folders).
urlDownload URL (HTTP/HTTPS).
size_mbApproximate download size in MB.
image_countNumber of images.
resolutionResolution description (e.g., "2K", "4K").
formatImage format (e.g., "PNG", "TIFF").
FieldDescription
storage_type"direct" (default), "google_drive", or "dropbox".
folder_idGoogle Drive folder ID (for "google_drive" storage type).
post_process"extract_zips" or "extract_multipart_zips" (for nested archives).
extracted_folderFolder name inside the archive after extraction.
rename_toRename extracted folder to this name.
licenseLicense information.
sourceOrganization providing the dataset.

For datasets hosted on Google Drive, set storage_type and provide the folder or file details:

{
"id": "my-gdrive-dataset",
"name": "My Google Drive Dataset",
"description": "4K images from Google Drive",
"type": "folder",
"url": "https://drive.google.com/drive/folders/FOLDER_ID",
"storage_type": "google_drive",
"folder_id": "FOLDER_ID",
"post_process": "extract_zips",
"size_mb": 1300,
"image_count": 80,
"resolution": "4K",
"format": "PNG",
"license": "CC BY-NC-ND 4.0",
"source": "Research Group"
}

Fetch the new dataset:

Terminal window
just fetch my-dataset

Check that images are in place:

Terminal window
ls data/datasets/my-dataset/

You can also verify all downloaded datasets:

Terminal window
python3 scripts/fetch_dataset.py --show-downloaded
  • Use lossless source images (PNG, TIFF) for unbiased format comparison. JPEG sources have pre-existing compression artifacts that skew quality metrics.
  • Include at least 50 images for statistical significance in analysis.
  • Consistent resolution within a dataset makes comparison cleaner, though the resolution encoder parameter can normalize images before encoding.

The configuration is validated against config/datasets.schema.json. IDEs with JSON Schema support (including VS Code in the dev container) will provide autocompletion and inline validation when you edit the file.