Datasets¶

FLTest loads and partitions datasets with flwr-datasets (backed by Hugging Face). Code: fltest/data/datasets.py.

Built-in datasets¶

Name	Channels	Classes	Image column	Notes
`mnist`	1	10	`image`	handwritten digits
`fashion_mnist`	1	10	`image`	clothing; harder than MNIST, same shape
`cifar10`	3	10	`img`	natural images (RGB)

Use one with dataset: cifar10, or fuzz several with dataset: [mnist, fashion_mnist, cifar10]. Channels and class count are derived automatically — you never set them by hand.

Grayscale datasets are resized to 32×32 and normalized to mean/std 0.5; RGB datasets are normalized per-channel to 0.5. (Defined in _TRANSFORMS.)

Partitioning (data distribution)¶

`data_distribution`	Effect	Relevant knob
`iid`	uniform random split; every client sees all classes	—
`dirichlet`	label skew across clients	`dirichlet_alpha` (lower ⇒ more skewed)
`pathological`	each client gets only N classes	`classes_per_partition`

Non-IID partitioning is how you stress robustness/privacy realistically (the project's Pitfall-2/3). Example:

dataset: cifar10
data_distribution: dirichlet
dirichlet_alpha: 0.1        # strongly non-IID
num_clients: 10

Use an existing dataset¶

Just name it:

dataset: fashion_mnist
data_distribution: pathological
classes_per_partition: 2

Attach a new dataset¶

Two small edits in fltest/data/datasets.py:

1. Register it in DATASET_CONFIG with (transform_key, image_column, channels, classes):

DATASET_CONFIG = {
    "mnist": ("grayscale", "image", 1, 10),
    "cifar10": ("rgb", "img", 3, 10),
    # new: a 3-channel, 100-class HF dataset whose image column is "img"
    "cifar100": ("rgb", "img", 3, 100),
}

transform_key selects a transform in _TRANSFORMS ("grayscale" or "rgb"). Add a new key there if your data needs a different transform.
image_column is the Hugging Face column holding the image (often image or img).
channels and classes are surfaced to the model and metrics.

2. (Only if needed) add a transform in _TRANSFORMS, e.g. for 28×28 inputs without resizing or for different normalization.

That's it — dataset: cifar100 now works, including fuzzing and all partitioners. The HF dataset name passed to flwr-datasets is the key you used ("cifar100"); use the fully-qualified HF id (e.g. "zalando-datasets/fashion_mnist") if the short name is ambiguous.

Custom / local data¶

get_federated_dataset() returns {"c2data": {cid: hf_dataset}, "test_data": hf_dataset} where each shard yields {"img": tensor, "label": tensor} after transform. To plug in data that isn't on Hugging Face, build those dicts yourself (any object exposing {"img","label"} batches works) and call build_dataloaders(...), or add a new partitioner to PARTITIONERS.

Add a new partitioner¶

PARTITIONERS maps a name to a factory f(num_partitions, **kwargs) -> Partitioner:

PARTITIONERS = {
    "iid": lambda n, **kw: IidPartitioner(num_partitions=n),
    "my_skew": lambda n, alpha=0.3, **kw: DirichletPartitioner(
        num_partitions=n, partition_by="label", alpha=alpha),
}

Then data_distribution: my_skew is usable from any config.