# TIL: Confidence vs. Variability

Tracking Metrics over Epochs to Understand Labels Better.

Vincent Warmerdam koaning.io
2021-07-21

I asked a question on twitter on finding bad labels in training data that gave some interesting responses. One suggested idea introduces the idea of tracking “variability” and “confidence” while training a neural network. It’s described in this arxiv paper.

The confidence metric $$\mu_i$$ for a data-point is the mean model probability of the true label that we measure across epochs.

$\hat{\mu}_{i}=\frac{1}{E} \sum_{e=1}^{E} p_{\boldsymbol{\theta}^{(e)}}\left(y_{i}^{*} \mid \boldsymbol{x}_{i}\right)$

The “variability” metric measures the spread of the estimated confidence across training epochs using standard deviation.

$\hat{\sigma}_{i}=\sqrt{\frac{\sum_{e=1}^E [p_{\boldsymbol{\theta}^{(e)}}\left(y_{i}^{*} \mid \boldsymbol{x}_{i}\right) - \mu_i ]^2}{E}}$

For both of these measure you got to note that $$p_{\boldsymbol{\theta}^{(e)}}$$ is the probability given weights $$\theta$$ during epoch $$e$$. As we train our system, the weights update and so do the estimated probabilities.

## Output

So what do you get when you train a system this way? You’ll get data that can generate images like this one;

After your training run, you kept track of the confidence and the variability in the training loop. You can then wonder what it might mean to have a data-point with high confidence and low variability. You could argue that these data-points were perhaps the ones that were easy to distinguish and learn.

Similarly you could argue that there’s a region where data-points are hard to label but also that there’s a region with ambiguous data-points. These groups of data-points may deserve an extra look. There may be bad labels in there or classes that deserve a better definition.

It’s a neat idea and I gotta try it out some time.

EDIT

I tried it out and didn’t really like it. The problem is that this method depends a lot on the number of epochs that you pick. Got a lot of epochs? This’ll impact the confidence and variability. Got very few epochs? Same!

There’s simpler tools out there that don’t require you to worry about the epochs and those certainly seem preferable.