Skip to content

Metrics — what FLTest measures

Every run produces a RunResult with per-round history and a final dict (last round). These metrics are what differential and metamorphic tests read.

Always measured (every backend, every round)

Computed by evaluating the global model on the central test set after each round:

Metric Meaning
accuracy top-1 accuracy of the global model on the test subset
loss mean cross-entropy loss on the test subset
gm_weight_sum sum of all global-model parameters — a cheap fingerprint to spot divergence/NaNs

The test subset size is max_test_data_size. These three appear in final for every run.

Produced by plugins (when configured)

Metric Produced by Meaning
attack_success_rate backdoor attack fraction of a triggered test set predicted as the target label (excludes samples already of the target)
reconstruction_mse dlg attack pixel MSE between the reconstructed and true victim image (lower = better reconstruction = worse privacy)
reconstruction_psnr dlg attack peak signal-to-noise ratio of the reconstruction (higher = better reconstruction)
label_recovery dlg attack fraction of victim labels correctly recovered
per_client_acc_mean / per_client_acc_min per_client listener personalized accuracy of the final global model on each client's own data — min exposes representation disparity (project Pitfall-3)

Add per_client to metrics: to enable personalized evaluation. Attack metrics appear automatically when the relevant attack is configured.

Where metrics live

  • result.history[round] — dict of metrics for that round.
  • result.final — metrics from the last round (what tests assert on).
  • result.extras — non-scalar detail (e.g. the DLG reconstruction summary).
  • JSON report under reports/ contains all of the above.

Which metric do the tests use?

Both testers operate on a single scalar metric from final, chosen by the config:

  • Differential (testing.differential.metric, default accuracy): compares that metric across frameworks. See Differential testing.
  • Metamorphic (per-relation metric, default accuracy): tracks that metric as one input parameter is swept. See Metamorphic testing.

You can point either at any metric in final — e.g. set a metamorphic relation's metric: attack_success_rate to assert that ASR is non-decreasing as attack strength rises, or metric: reconstruction_mse to assert it rises as DP noise increases.