Datasets¶
FLTest loads and partitions datasets with flwr-datasets
(backed by Hugging Face). Code: fltest/data/datasets.py.
Built-in datasets¶
| Name | Channels | Classes | Image column | Notes |
|---|---|---|---|---|
mnist |
1 | 10 | image |
handwritten digits |
fashion_mnist |
1 | 10 | image |
clothing; harder than MNIST, same shape |
cifar10 |
3 | 10 | img |
natural images (RGB) |
Use one with dataset: cifar10, or fuzz several with dataset: [mnist, fashion_mnist, cifar10].
Channels and class count are derived automatically — you never set them by hand.
Grayscale datasets are resized to 32×32 and normalized to mean/std 0.5; RGB datasets are
normalized per-channel to 0.5. (Defined in _TRANSFORMS.)
Partitioning (data distribution)¶
data_distribution |
Effect | Relevant knob |
|---|---|---|
iid |
uniform random split; every client sees all classes | — |
dirichlet |
label skew across clients | dirichlet_alpha (lower ⇒ more skewed) |
pathological |
each client gets only N classes | classes_per_partition |
Non-IID partitioning is how you stress robustness/privacy realistically (the project's Pitfall-2/3). Example:
dataset: cifar10
data_distribution: dirichlet
dirichlet_alpha: 0.1 # strongly non-IID
num_clients: 10
Use an existing dataset¶
Just name it:
dataset: fashion_mnist
data_distribution: pathological
classes_per_partition: 2
Attach a new dataset¶
Two small edits in fltest/data/datasets.py:
1. Register it in DATASET_CONFIG with (transform_key, image_column, channels, classes):
DATASET_CONFIG = {
"mnist": ("grayscale", "image", 1, 10),
"cifar10": ("rgb", "img", 3, 10),
# new: a 3-channel, 100-class HF dataset whose image column is "img"
"cifar100": ("rgb", "img", 3, 100),
}
transform_keyselects a transform in_TRANSFORMS("grayscale"or"rgb"). Add a new key there if your data needs a different transform.image_columnis the Hugging Face column holding the image (oftenimageorimg).channelsandclassesare surfaced to the model and metrics.
2. (Only if needed) add a transform in _TRANSFORMS, e.g. for 28×28 inputs without
resizing or for different normalization.
That's it — dataset: cifar100 now works, including fuzzing and all partitioners. The HF
dataset name passed to flwr-datasets is the key you used ("cifar100"); use the
fully-qualified HF id (e.g. "zalando-datasets/fashion_mnist") if the short name is
ambiguous.
Custom / local data¶
get_federated_dataset() returns {"c2data": {cid: hf_dataset}, "test_data": hf_dataset}
where each shard yields {"img": tensor, "label": tensor} after transform. To plug in data
that isn't on Hugging Face, build those dicts yourself (any object exposing
{"img","label"} batches works) and call build_dataloaders(...), or add a new partitioner
to PARTITIONERS.
Add a new partitioner¶
PARTITIONERS maps a name to a factory f(num_partitions, **kwargs) -> Partitioner:
PARTITIONERS = {
"iid": lambda n, **kw: IidPartitioner(num_partitions=n),
"my_skew": lambda n, alpha=0.3, **kw: DirichletPartitioner(
num_partitions=n, partition_by="label", alpha=alpha),
}
Then data_distribution: my_skew is usable from any config.