Hmm…?
You’re seeing two different “sharding layers” at once:
- Logical shards: what
dataset_info.splits.*.shard_lengthsis describing. - Physical shards: the actual
.arrowfiles created bysave_to_disk, controlled bymax_shard_size/num_shards.
They are not the same thing, and HF Datasets does not keep them in sync anymore. That is why “7 shards in shard_lengths” and “~67 files on disk” both exist and both are “correct” in their own sense.
I’ll walk through what your script is doing and how HF Datasets behaves internally.
1. What your script actually does
You run:
def h5_generator(h5_path):
with h5py.File(h5_path, "r") as f:
grp_names = sorted(f.keys())
for g in grp_names:
grp = f[g]
yield {
"act": grp["act"][:],
"wav": grp["wav"][:],
}
ds = Dataset.from_generator(
h5_generator,
gen_kwargs={"h5_path": DATA_PATH/h5_path},
writer_batch_size=500,
cache_dir=str(DATA_PATH)
)
Key points:
-
Dataset.from_generator(...)builds a temporary Arrow dataset incache_dir.- By default, HF Datasets writes to disk every 1000 rows; you changed that to 500 via
writer_batch_size=500. (Hugging Face Forums) - Each “writer batch” is a logical shard of ~500 rows.
- By default, HF Datasets writes to disk every 1000 rows; you changed that to 500 via
-
You then do:
dataset = DatasetDict(dataset_dict) dataset.save_to_disk(DATA_PATH/"hf_datasets")save_to_diskwill re-write the dataset into a new directory, using its own sharding logic based onmax_shard_sizeand/ornum_shards. It does not preserve your original writer-batch boundaries. (Hugging Face)
So there are effectively two separate representations:
- A cached representation in
cache_dir(your 7×~5 GB files). - A final “saved to disk” representation in
hf_datasets/(your ~67×480 MB files).
2. Why shard_lengths says 7 but you see ~67 .arrow files
2.1 What shard_lengths actually means now
dataset_info.splits["train"].shard_lengths is part of the legacy SplitInfo metadata that comes from TensorFlow Datasets semantics (number of examples per original shard). HF Datasets keeps it for backward compatibility, but it is effectively deprecated now:
- In the
datasets.splitscode, you can see bothshard_lengthsandoriginal_shard_lengthsfields, with comments marking them as deprecated and kept only so dataset_infos stay compatible. (lovelace.cluster.earlham.edu)
So:
shard_lengths = [500, 500, 500, 500, 500, 500, 239]- is describing the original generator write batches used when constructing the dataset, not the current on-disk
.arrowfile layout aftersave_to_disk.
Hugging Face’s own docs emphasize that saving creates “arrow files” plus a dataset_info.json, but it doesn’t say the shard_lengths field is updated to match the new file slicing. (Hugging Face)
In other words:
shard_lengthsis logical/legacy metadata, not a faithful description of how many.arrowfiles are in your saved folder.
2.2 How save_to_disk decides how many .arrow files to create
Dataset.save_to_disk shards data based on size in bytes, not your earlier writer_batch_size.
Internally, it calls into a sharding routine that uses:
num_shards(if you pass it explicitly), ormax_shard_size(if you don’t), with a default of 500 MB (config.MAX_SHARD_SIZE = "500MB"). (GitHub)
The logic (simplified) is:
num_shards = floor(dataset_nbytes / max_shard_size) + 1 (if num_shards is None)
Given your num_bytes ≈ 33,296,374,180:
- 500 MB = 500 * 1024^2 ≈ 524,288,000 bytes.
- 33,296,374,180 / 524,288,000 ≈ 63.56
- → num_shards ≈ 64–67 depending on overhead and internal counting.
You report:
- ~33 GB total,
- shard files around ~480 MB each,
- total ~67 files.
That’s exactly what you expect for “dataset size divided by ~500 MB”. The fact they’re ~480 MB not exactly 500 MB is just internal padding / Arrow overhead.
So the picture is:
-
Generator build phase:
writer_batch_size=500→ HF creates 7 logical shards of 500 examples (last smaller).- These are stored under
cache_diras ~5 GB Arrow files (your 7 files).
-
save_to_disk phase:
- Ignores those original shards.
- Re-shards based on size, using ~500 MB target size.
- Produces ~67
.arrowfiles of ~480 MB each underhf_datasets/....
dataset_info.splits["train"].shard_lengths still reflects the first phase (500-example shards), not the second.
3. Why you see “7×5 GB in cache_dir” vs “67×480 MB in hf_datasets”
This comes from the two different uses of “shard” inside HF Datasets:
-
Cache-level shards (generator / map / transforms)
- Every call to
from_generatoror.mapuses a cache writer. - This writer flushes data to disk every
writer_batch_sizeexamples. (Hugging Face Forums) - Those flushes create intermediate Arrow tables, tracked in
dataset.cache_files. - That’s where your 7×~5 GB files in
cache_dircome from (500 examples per writer batch, but examples are huge arrays so each file is big).
- Every call to
-
Final dataset shards (save_to_disk / push_to_hub)
save_to_diskre-chunks the dataset based on overall byte size, usingmax_shard_size(default 500 MB) ornum_shards. (Hugging Face)- Each new shard is written as a separate
.arrowfile. - That’s your ~67 files of ~480 MB.
The dataset_info you’re inspecting is just copying the DatasetInfo object from the generator stage, including the old shard_lengths. That metadata is not regenerated to match the final arrow files.
If you want to see the actual physical files the final dataset is using, you should check:
from datasets import load_from_disk
ds = load_from_disk(DATA_PATH/"hf_datasets")["train"]
print(ds.cache_files)
ds.cache_files will list the .arrow files and the number of rows they each contain. (Hugging Face)
That list should have ~67 entries for the saved dataset, even though dataset_info.splits["train"].shard_lengths still has 7 numbers.
4. How to control the number/size of saved shards
If you want the final saved dataset to have a particular sharding (not 67 × 480 MB), you can control it directly when saving:
-
Limit shard size by bytes:
ds.save_to_disk("hf_datasets", max_shard_size="5GB") -
Or specify exact number of shards:
ds.save_to_disk("hf_datasets", num_shards=7)
These options are documented (or at least exemplified) in HF Datasets’ docs and helper code. (Hugging Face)
That way:
max_shard_size="5GB"would give you roughly the same 7×5 GB layout you see in your generator cache.num_shards=7forces exactly 7 files regardless of size, as long as they fit.
Note: the shard_lengths field in dataset_info is still informational / legacy and may not equal the physical file count even then; to inspect the real structure, always look at cache_files and the actual .arrow files.
5. Sanity check with your numbers
Let’s reconcile your numbers explicitly:
-
dataset_info.dataset_size = 33,296,374,180bytes (~31 GiB). -
shard_lengths = [500, 500, 500, 500, 500, 500, 239]→ 3239 examples total. -
cache_dir:- 7 cache files, each ~5 GB → matches the 7 generator “writer batches” of 500 rows (your
writer_batch_size=500).
- 7 cache files, each ~5 GB → matches the 7 generator “writer batches” of 500 rows (your
-
hf_datasets/:- 67 arrow files, each ~480 MB.
- 33.3 GB / 0.48 GB ≈ 69 → consistent with
max_shard_size ≈ 0.5 GBand overhead/rounding.
So the behavior is exactly:
- Generator-level sharding controlled by
writer_batch_size→ 7 logical shards (whatshard_lengthsremembers). - Final save-level sharding controlled by
max_shard_size(default 500 MB) → ~67 physical arrow files.
No data is lost or duplicated; it’s just two distinct sharding layers.
TL;DR
- The
shard_lengthsindataset_infocomes from the generator phase and describes logical shard sizes (here: batches of 500 examples). It’s deprecated / legacy metadata and not kept in sync with final.arrowfiles. (lovelace.cluster.earlham.edu) save_to_diskre-shards the dataset based on byte size, using a defaultmax_shard_sizeof 500 MB, which is why you see ~67 arrow files of ~480 MB each. (GitHub)- The 7 big (~5 GB) files in
cache_dirare fromfrom_generator’swriter_batch_size=500, not fromsave_to_disk. - To understand the actual on-disk layout of the final dataset, ignore
shard_lengthsand inspectds.cache_filesand the.arrowfiles themselves.