Differential testing¶
Idea: runs that differ only in some controlled way should agree. FLTest supports two
modes (testing.differential.mode). Code: fltest/testing/differential.py.
What is measured¶
A single scalar metric from each run's final dict, selected by
testing.differential.metric (default accuracy). Typically the final-round global-model
test accuracy.
Mode 1: cross_framework (parity)¶
Same logical config, different FL framework ⇒ same final metric within tolerance. This surfaces framework-implementation divergences (the kind of bug the project's AppFL CUDA-on-CPU finding represents).
How it groups: runs are grouped by a logical key — every spec field except the
framework (dataset, data_distribution, model_name, num_clients, num_rounds,
client_epochs, client_lr, client_batch_size, optimizer, seed, and the
attack/defense set). Within each group of ≥2 frameworks:
Rule (parity): max(metric) − min(metric) ≤ tolerance ⇒ PASS.
runs:
- {framework: reference}
- {framework: flwr}
- {framework: nvflare}
testing:
differential: {mode: cross_framework, metric: accuracy, tolerance: 0.05}
[✓ PASS] differential: mnist/iid MLP c3 r2 :: ['reference', 'flwr', 'nvflare']
max|Δ|=0.0244 (tol=0.1)
Parity, not bit-identity
Different frameworks use different RNG streams and execution paths, so exact equality is impossible. The reference and Flower backends share FLTest's own weighted-mean aggregation, which keeps them close; the tolerance (default 0.05, as in the project slides) absorbs the residual. Choose tolerance to match your metric's natural variance.
Mode 2: determinism¶
The same spec run twice must give an (almost) identical result — surfaces hidden nondeterminism / state leakage.
Rule: |metric(run1) − metric(run2)| ≤ tolerance (use a tiny tolerance, e.g. 1e-4),
on device: cpu.
testing:
differential: {mode: determinism, metric: accuracy, tolerance: 0.0001}
Run it¶
fltest diff <config.yaml>
Prints a PASS/FAIL table, writes reports/<name>_differential.json, and exits non-zero if
any check fails (CI-friendly).