Metrics — what FLTest measures¶
Every run produces a RunResult with per-round history and a final dict (last round).
These metrics are what differential and metamorphic tests read.
Always measured (every backend, every round)¶
Computed by evaluating the global model on the central test set after each round:
| Metric | Meaning |
|---|---|
accuracy |
top-1 accuracy of the global model on the test subset |
loss |
mean cross-entropy loss on the test subset |
gm_weight_sum |
sum of all global-model parameters — a cheap fingerprint to spot divergence/NaNs |
The test subset size is max_test_data_size. These three appear in final for every run.
Produced by plugins (when configured)¶
| Metric | Produced by | Meaning |
|---|---|---|
attack_success_rate |
backdoor attack |
fraction of a triggered test set predicted as the target label (excludes samples already of the target) |
reconstruction_mse |
dlg attack |
pixel MSE between the reconstructed and true victim image (lower = better reconstruction = worse privacy) |
reconstruction_psnr |
dlg attack |
peak signal-to-noise ratio of the reconstruction (higher = better reconstruction) |
label_recovery |
dlg attack |
fraction of victim labels correctly recovered |
per_client_acc_mean / per_client_acc_min |
per_client listener |
personalized accuracy of the final global model on each client's own data — min exposes representation disparity (project Pitfall-3) |
Add per_client to metrics: to enable personalized evaluation. Attack metrics appear
automatically when the relevant attack is configured.
Where metrics live¶
result.history[round]— dict of metrics for that round.result.final— metrics from the last round (what tests assert on).result.extras— non-scalar detail (e.g. the DLG reconstruction summary).- JSON report under
reports/contains all of the above.
Which metric do the tests use?¶
Both testers operate on a single scalar metric from final, chosen by the config:
- Differential (
testing.differential.metric, defaultaccuracy): compares that metric across frameworks. See Differential testing. - Metamorphic (per-relation
metric, defaultaccuracy): tracks that metric as one input parameter is swept. See Metamorphic testing.
You can point either at any metric in final — e.g. set a metamorphic relation's
metric: attack_success_rate to assert that ASR is non-decreasing as attack strength rises,
or metric: reconstruction_mse to assert it rises as DP noise increases.