Is huggingface dataset suitable for ddp training?

Hmm…?


You’re seeing two different “sharding layers” at once:

  • Logical shards: what dataset_info.splits.*.shard_lengths is describing.
  • Physical shards: the actual .arrow files created by save_to_disk, controlled by max_shard_size / num_shards.

They are not the same thing, and HF Datasets does not keep them in sync anymore. That is why “7 shards in shard_lengths” and “~67 files on disk” both exist and both are “correct” in their own sense.

I’ll walk through what your script is doing and how HF Datasets behaves internally.


1. What your script actually does

You run:

def h5_generator(h5_path):
    with h5py.File(h5_path, "r") as f:
        grp_names = sorted(f.keys())
        for g in grp_names:
            grp = f[g]
            yield {
                "act": grp["act"][:],
                "wav": grp["wav"][:],
            }

ds = Dataset.from_generator(
    h5_generator,
    gen_kwargs={"h5_path": DATA_PATH/h5_path},
    writer_batch_size=500,
    cache_dir=str(DATA_PATH)
)

Key points:

  1. Dataset.from_generator(...) builds a temporary Arrow dataset in cache_dir.

    • By default, HF Datasets writes to disk every 1000 rows; you changed that to 500 via writer_batch_size=500. (Hugging Face Forums)
    • Each “writer batch” is a logical shard of ~500 rows.
  2. You then do:

    dataset = DatasetDict(dataset_dict)
    dataset.save_to_disk(DATA_PATH/"hf_datasets")
    

    save_to_disk will re-write the dataset into a new directory, using its own sharding logic based on max_shard_size and/or num_shards. It does not preserve your original writer-batch boundaries. (Hugging Face)

So there are effectively two separate representations:

  • A cached representation in cache_dir (your 7×~5 GB files).
  • A final “saved to disk” representation in hf_datasets/ (your ~67×480 MB files).

2. Why shard_lengths says 7 but you see ~67 .arrow files

2.1 What shard_lengths actually means now

dataset_info.splits["train"].shard_lengths is part of the legacy SplitInfo metadata that comes from TensorFlow Datasets semantics (number of examples per original shard). HF Datasets keeps it for backward compatibility, but it is effectively deprecated now:

  • In the datasets.splits code, you can see both shard_lengths and original_shard_lengths fields, with comments marking them as deprecated and kept only so dataset_infos stay compatible. (lovelace.cluster.earlham.edu)

So:

  • shard_lengths = [500, 500, 500, 500, 500, 500, 239]
  • is describing the original generator write batches used when constructing the dataset, not the current on-disk .arrow file layout after save_to_disk.

Hugging Face’s own docs emphasize that saving creates “arrow files” plus a dataset_info.json, but it doesn’t say the shard_lengths field is updated to match the new file slicing. (Hugging Face)

In other words:

shard_lengths is logical/legacy metadata, not a faithful description of how many .arrow files are in your saved folder.

2.2 How save_to_disk decides how many .arrow files to create

Dataset.save_to_disk shards data based on size in bytes, not your earlier writer_batch_size.

Internally, it calls into a sharding routine that uses:

  • num_shards (if you pass it explicitly), or
  • max_shard_size (if you don’t), with a default of 500 MB (config.MAX_SHARD_SIZE = "500MB"). (GitHub)

The logic (simplified) is:

num_shards = floor(dataset_nbytes / max_shard_size) + 1  (if num_shards is None)

Given your num_bytes ≈ 33,296,374,180:

  • 500 MB = 500 * 1024^2 ≈ 524,288,000 bytes.
  • 33,296,374,180 / 524,288,000 ≈ 63.56
  • → num_shards ≈ 64–67 depending on overhead and internal counting.

You report:

  • ~33 GB total,
  • shard files around ~480 MB each,
  • total ~67 files.

That’s exactly what you expect for “dataset size divided by ~500 MB”. The fact they’re ~480 MB not exactly 500 MB is just internal padding / Arrow overhead.

So the picture is:

  • Generator build phase:

    • writer_batch_size=500 → HF creates 7 logical shards of 500 examples (last smaller).
    • These are stored under cache_dir as ~5 GB Arrow files (your 7 files).
  • save_to_disk phase:

    • Ignores those original shards.
    • Re-shards based on size, using ~500 MB target size.
    • Produces ~67 .arrow files of ~480 MB each under hf_datasets/....

dataset_info.splits["train"].shard_lengths still reflects the first phase (500-example shards), not the second.


3. Why you see “7×5 GB in cache_dir” vs “67×480 MB in hf_datasets

This comes from the two different uses of “shard” inside HF Datasets:

  1. Cache-level shards (generator / map / transforms)

    • Every call to from_generator or .map uses a cache writer.
    • This writer flushes data to disk every writer_batch_size examples. (Hugging Face Forums)
    • Those flushes create intermediate Arrow tables, tracked in dataset.cache_files.
    • That’s where your 7×~5 GB files in cache_dir come from (500 examples per writer batch, but examples are huge arrays so each file is big).
  2. Final dataset shards (save_to_disk / push_to_hub)

    • save_to_disk re-chunks the dataset based on overall byte size, using max_shard_size (default 500 MB) or num_shards. (Hugging Face)
    • Each new shard is written as a separate .arrow file.
    • That’s your ~67 files of ~480 MB.

The dataset_info you’re inspecting is just copying the DatasetInfo object from the generator stage, including the old shard_lengths. That metadata is not regenerated to match the final arrow files.

If you want to see the actual physical files the final dataset is using, you should check:

from datasets import load_from_disk

ds = load_from_disk(DATA_PATH/"hf_datasets")["train"]
print(ds.cache_files)

ds.cache_files will list the .arrow files and the number of rows they each contain. (Hugging Face)

That list should have ~67 entries for the saved dataset, even though dataset_info.splits["train"].shard_lengths still has 7 numbers.


4. How to control the number/size of saved shards

If you want the final saved dataset to have a particular sharding (not 67 × 480 MB), you can control it directly when saving:

  • Limit shard size by bytes:

    ds.save_to_disk("hf_datasets", max_shard_size="5GB")
    
  • Or specify exact number of shards:

    ds.save_to_disk("hf_datasets", num_shards=7)
    

These options are documented (or at least exemplified) in HF Datasets’ docs and helper code. (Hugging Face)

That way:

  • max_shard_size="5GB" would give you roughly the same 7×5 GB layout you see in your generator cache.
  • num_shards=7 forces exactly 7 files regardless of size, as long as they fit.

Note: the shard_lengths field in dataset_info is still informational / legacy and may not equal the physical file count even then; to inspect the real structure, always look at cache_files and the actual .arrow files.


5. Sanity check with your numbers

Let’s reconcile your numbers explicitly:

  • dataset_info.dataset_size = 33,296,374,180 bytes (~31 GiB).

  • shard_lengths = [500, 500, 500, 500, 500, 500, 239] → 3239 examples total.

  • cache_dir:

    • 7 cache files, each ~5 GB → matches the 7 generator “writer batches” of 500 rows (your writer_batch_size=500).
  • hf_datasets/:

    • 67 arrow files, each ~480 MB.
    • 33.3 GB / 0.48 GB ≈ 69 → consistent with max_shard_size ≈ 0.5 GB and overhead/rounding.

So the behavior is exactly:

  • Generator-level sharding controlled by writer_batch_size → 7 logical shards (what shard_lengths remembers).
  • Final save-level sharding controlled by max_shard_size (default 500 MB) → ~67 physical arrow files.

No data is lost or duplicated; it’s just two distinct sharding layers.


TL;DR

  • The shard_lengths in dataset_info comes from the generator phase and describes logical shard sizes (here: batches of 500 examples). It’s deprecated / legacy metadata and not kept in sync with final .arrow files. (lovelace.cluster.earlham.edu)
  • save_to_disk re-shards the dataset based on byte size, using a default max_shard_size of 500 MB, which is why you see ~67 arrow files of ~480 MB each. (GitHub)
  • The 7 big (~5 GB) files in cache_dir are from from_generator’s writer_batch_size=500, not from save_to_disk.
  • To understand the actual on-disk layout of the final dataset, ignore shard_lengths and inspect ds.cache_files and the .arrow files themselves.
1 Like