It seems like there might be a lot of tricky spots to achieve this…
1) Converting the corpus into
Datasets (recommended approach)
What to store in Arrow vs keep external
A datasets.Dataset is an Arrow table, which gives fast reads and convenient column operations, but it doesn’t require you to store the heavy bytes inside Arrow. In fact, the “Image” feature typically expects paths or embedded bytes, which would either force you to (a) extract images to files or (b) duplicate LMDB bytes into Arrow. (Hugging Face)
Recommended: store metadata only in Arrow and keep LMDB as the image store. At training time, decode by file_name using a transform.
Shape of the dataset(s)
You can represent your multi-source corpus in either of these two equivalent ways:
- Option A (common):
DatasetDict with one Dataset per source
- Option B: one merged
Dataset with a source column
Option A is usually nicer when each source has different policies (sampling freq/caps/epoch splits), because you can keep per-source stats and debug easily.
Flatten the JSON grouped by "W_H"
Your idea to de-group and create explicit columns is correct. Suggested columns (superset of what you proposed):
file_name (LMDB key)
label, type
width, height (parsed from "W_H")
resolution (string, optional)
bucket_id (int; exact (W,H) or binned)
is_doc (per-source constant, but store it per-row so transforms can branch)
source (string)
Why not store image as a Datasets “Image” feature here?
HF’s Image feature decodes from paths or stored bytes, and decode=False can return {path, bytes} instead of PIL.Image. This is useful when your data is already file-based or already stored as bytes in Arrow, but it doesn’t directly match “bytes live in LMDB keyed by file_name” without copying. (Hugging Face)
Training-time decode/augment via set_transform / with_transform
HF explicitly supports on-the-fly transforms for images (set_transform) and recommends using map() for one-time preprocessing and set_transform for per-epoch augmentations. (Hugging Face)
So the “HF datasets” part becomes:
-
offline: build metadata datasets and save_to_disk()
-
training: load_from_disk(), then attach a transform that:
- reads LMDB bytes by
file_name
- applies doc/scene augmentation based on
is_doc
- runs your image processor and returns tensors + passthrough metadata
2) Can HF Datasets implement your sampling strategy?
What HF can do directly (partial)
HF provides interleave_datasets() with:
- probabilities (stochastic mixing)
- stopping strategies like
first_exhausted and oversampling modes like all_exhausted (Hugging Face)
This is helpful for rough mixing, but it does not naturally express:
- “exactly 10 epochs with explicit epoch semantics”
- “source A is exactly 1× per epoch and source B is exactly 5× per epoch”
- “bucket by resolution and form batches within buckets”
- “DDP-safe, deterministic, disjoint work per rank”
What you should do for strict semantics (recommended)
Use HF Datasets for indexable metadata storage and PyTorch for policy:
-
HF Dataset: metadata table, deterministic indexing
-
Training transform: LMDB decode + augmentation (set_transform / with_transform) (Hugging Face)
-
Custom PyTorch BatchSampler: the control plane for:
- per-source quotas per epoch (1×, 5×, caps)
- optional epoch-specific subset selection
- bucketing (
bucket_id) → batches
- deterministic shuffles via
set_epoch(epoch)
- DDP partitioning at batch level
This division matches what HF is best at (dataset storage + column ops) and what PyTorch is best at (sampling/batching control).
3) DDP sharding: memory-mapped Arrow is not sharding
Arrow memory mapping improves I/O and avoids loading everything into RAM, but it doesn’t prevent two ranks from reading the same indices. You still need an explicit sharding mechanism (sampler or dataset sharding).
HF provides Dataset.shard(num_shards, index) as a deterministic way to split a dataset into N pieces. (Hugging Face)
HF’s Trainer/Accelerate will do distributed sharding for you, but if you’re building a custom loop you must ensure your own per-rank split. (Hugging Face Forums)
Separately, with PyTorch’s DistributedSampler, you must call sampler.set_epoch(epoch) each epoch to get different shuffles across epochs; otherwise, you can repeat the same order. (PyTorch Forums)
Practical recommendation for your case: partition at the batch level (global deterministic batch schedule, then rank takes every world_size-th batch). This makes “disjoint work + identical step counts” easy to reason about.
4) Resolution grouping (“bucketing”) with HF + PyTorch
HF itself doesn’t have a built-in “bucketed batch sampler” for your exact needs. The clean approach is:
- store
bucket_id as a metadata column
- in your sampler, build
bucket_id -> indices per source
- shuffle within bucket per epoch
- emit batches that are bucket-pure (or near-pure if you allow spillover)
HF docs explicitly note that if you want efficient batching and transforms, using a BatchSampler is the right tool on the PyTorch side. (Hugging Face)
5) Are predefined per-epoch split files necessary/common?
They’re not “standard” unless you’re doing something like:
- curriculum learning (difficulty schedule)
- strict replay / auditability requirements
- staged inclusion (e.g., progressively adding noisy sources)
- hard constraints from upstream labeling processes
If you don’t have one of those reasons, they mainly add complexity.
Simpler alternative: add a deterministic split_id per sample (e.g., hash of file_name) and define epoch subsets as rules like “epoch 0 uses split_id in [0..k)”, epoch 1 uses [k..2k), etc. This avoids managing 10 external JSON lists and keeps the logic local and reproducible.
Summary recommendation
- Convert each source to a metadata-only HF
Dataset (file_name,label,type,is_doc,width,height,bucket_id,source,...) and save_to_disk.
- Use
set_transform / with_transform to decode LMDB + augment + image_processor on the fly. (Hugging Face)
- Implement strict mixing (1×/5×), epoch semantics, bucketing, and DDP disjointness in a custom PyTorch BatchSampler, not in
interleave_datasets. (Hugging Face)
- Do not rely on Arrow memory mapping for sharding; use
Dataset.shard/DistributedSampler/batch-level partitioning. (Hugging Face)