Common Sense reduced to Priors

The goal of this document is to summarise a lesson we’ve had in the last year. We’ve done a lot of work on algorithmic bias (and open-sourced it). We reflected and concluded that constraints are an amazing idea that deserve to be used more often in machine learning.

This point drove us write the following formula on a whiteboard;

\[ \text{model} = \text{data} \times \text{constraints} \]

After writing it down, we noticed that we’ve seen this before but in a different notation.

\[ p(\theta | D) = p(\text{D} | \theta) p(\theta) \]

It’s poetic: maybe … just maybe … priors can be interpreted as constraints that we wish to impose on models. It is knowledge that we have about how the model *should* work even if the data wants to push us in another direction.

So what we’d like to do in this blogpost is explore the idea of constraints a bit more. First by showcasing how our open source package deals with it but then by showing how a probibalistic approach might be able to use bayes rule to go an extra mile.

We will use a dataset from scikit-lego. It contains traffic arrests in Toronto and it is our job to predict if somebody is released after they are arrested. It has attributes for skin color, gender, age, employment, citizenship, past interactions and date. We consider date, employment and citizenship to be proxies that go into the model while we keep gender, skin color and age seperate as sensitive attributes that we want to remain fair on.

Here’s a preview of the dataset.

released | colour | year | age | sex | employed | citizen | checks |
---|---|---|---|---|---|---|---|

Yes | False | 2002 | True | False | Yes | Yes | 3 |

No | True | 1999 | True | False | Yes | Yes | 3 |

Yes | False | 2000 | True | False | Yes | Yes | 3 |

No | True | 2000 | False | False | Yes | Yes | 1 |

Yes | True | 1999 | False | True | Yes | Yes | 1 |

The dataset is interesting for many reasons. Not only is fairness a risk; there is also a balancing issue. The balancing issue can be dealt with by adding a `class_weight`

parameter while the fairness can be dealt with in many ways (exibit A, exibit B). A favorable method (we think so) is to apply a hard constraint. Our implementation of `EqualOpportunityClassifier`

does this running a logistic regression constrained by the distance to the decision boundary in two groups.

```
from sklearn.linear_model import LogisticRegression
from sklego.linear_model import EqualOpportunityClassifier
unfair_model = LogisticRegression(class_weight='balanced')
fair_model = EqualOpportunityClassifier(
covariance_threshold=0.9, # strictness of threshold
positive_target='Yes', # name of the preferable label
sensitive_cols=[0, 1, 2] # columns in X that are considered sensitive
)
unfair_model.fit(X, y)
fair_model.fit(X, y)
```

Logstic Regression works by optimising the log likelihood.

\[ \begin{array}{cl}{\operatorname{minimize}} & -\sum_{i=1}^{N} \log p\left(y_{i} | \mathbf{x}_{i}, \boldsymbol{\theta}\right) \\\end{array} \]

But what if we add constraints here? That’s what the `EqualOpportunityClassifier`

does.

\[ \begin{array}{cl}{\operatorname{minimize}} & -\sum_{i=1}^{N} \log p\left(y_{i} | \mathbf{x}_{i}, \boldsymbol{\theta}\right) \\ {\text { subject to }} & {\frac{1}{POS} \sum_{i=1}^{POS}\left(\mathbf{z}_{i}-\overline{\mathbf{z}}\right) d \boldsymbol{\theta}\left(\mathbf{x}_{i}\right) \leq \mathbf{c}} \\ {} & {\frac{1}{POS} \sum_{i=1}^{POS}\left(\mathbf{z}_{i}-\overline{\mathbf{z}}\right) d_{\boldsymbol{\theta}}\left(\mathbf{x}_{i}\right) \geq-\mathbf{c}}\end{array} \]

It the log loss while constraining the correlation between the specified `sensitive_cols`

and the distance to the decision boundary of the classifier for those examples that have a `y_true`

of 1.

The main difference between the two approaches is that in the Logistic Regression scheme we drop the sensitive columns while the other approach actively corrects for them. The table below shows the cross-validated summary of the mean test performance of both models.

model | eqo_color | eqo_age | eqo_sex | precision | recall |
---|---|---|---|---|---|

LR | 0.6986 | 0.7861 | 0.8309 | 0.9187 | 0.6345 |

EOC | 0.9740 | 0.9929 | 0.9892 | 0.8353 | 0.9893 |

You can also confirm the difference between the two models by looking at their coefficients.

model | intercept | employed | citizen | year | checks |
---|---|---|---|---|---|

LR | -1.0655 | 0.7913 | 0.7537 | -0.0101 | -0.5951 |

EOC | 0.5833 | 0.7710 | 0.6826 | -0.0196 | -0.5798 |

There are a few things to note at this stage;

- The more fair model seems to shift the intercept drastically but the other columns (which are normalised) less so.
- The more fair is arguably still useful despite having a lower precision. It compensates with fairness but also with recall.
- Having a fairness constraint based on distance to the decision boundary is one description of fairness. There are many other measures of fairness and we’re only a particular one because it offers us a convex algorithm that can give us guarantees on the training data. It would be nice to have the opportunity to define more flexible definitions of fairness.
- The current modelling approach allows us to make predictions but these predictions do not have uncertainty bounds.
- Linear models are fine, but sometimes we may want to apply constraints to hierarchical models.

This brings us back to the formulae that we started with.

\[ \text{model} = \text{data} \times \text{constraints} \]

In our case the constraints we want concern fairness.

\[ p(\theta | D) \propto \underbrace{p(D | \theta)}_{\text{data}} \underbrace{p(\theta)}_{\text{fairness?}} \]

So can we come up with a prior for that?

To explore this idea we set out to reproduce our results from earlier in PyMC3. We started with an implementation of logistic regression but found that it did not match our earlier results. The results of the trace are listed below. We show the distribution of the weights as well as a distribution over the unfairness.

```
with pm.Model() as unbalanced_model:
intercept = pm.Normal('intercept', 0, 1)
weights = pm.Normal('weights', 0, 1, shape=X.shape[1])
p = pm.math.sigmoid(intercept + pm.math.dot(X, weights))
dist_colour = intercept + pm.math.dot(X_colour, weights)
dist_non_colour = intercept + pm.math.dot(X_non_colour, weights)
mu_diff = pm.Deterministic('mu_diff', dist_colour.mean() - dist_non_colour.mean())
pm.Bernoulli('released', p, observed=df['released'])
unbalanced_trace = pm.sample(tune=1000, draws=1000, chains=6)
```

You might notice that the results of the traceplot do not agree with the coefficients that we’ve found earlier. This was because our original logistic regression had a `balanced`

setting which is not included in our PyMC3 approach. Luckily for us PyMC3 has a feature to address this; `pm.Potential`

.

`pm.Potential`

The idea behind the potential is that you add a prior on a combination of parameters instead of just having it on a single one. For example, this is how you’d usually set parameters;

```
mu = pm.Normal('mu', 0, 1)
sigma = pm.HalfNormal('sigma', 0, 1)
```

By setting the `sigma`

prior to be `HalfNormal`

we prevent it from ever becoming negative. But what if we’d like to set another prior, namely that \(\mu \approx \sigma\)? This is what `pm.Potential`

can be used for.

```
pm.Potential('balance', pm.Normal.dist(0, 0.1).logp(mu - sigma))
```

Adding a potential has an effect on the likelihood of a tracepoint.

This in turn will make the posterior look different.

So we made a second version of the logistic regression.

```
with pm.Model() as balanced_model:
intercept = pm.Normal('intercept', 0, 1)
weights = pm.Normal('weights', 0, 1, shape=X.shape[1])
p = pm.math.sigmoid(intercept + pm.math.dot(X, weights))
dist_colour = intercept + pm.math.dot(X_colour, weights)
dist_non_colour = intercept + pm.math.dot(X_non_colour, weights)
mu_diff = pm.Deterministic('mu_diff', dist_colour.mean() - dist_non_colour.mean())
pm.Potential('balance', sample_weights.values * pm.Bernoulli.dist(p).logp(df['released'].values))
balanced_trace = pm.sample(tune=1000, draws=1000, chains=6)
```

These results were in line with our previous result again.

But that `pm.Potential`

can also be used for other things!

Suppose we have our original trace that generates our posterior.

Now also suppose that we have a function that describes our potential.

Then these two can be combined. Our prior can span beyond a single parameter! Our belief of how the model should behave before seeing data can influence the entire posterior. So we’ve *literally* come up with a prior that has a huge potential for fairness.

```
X_colour, X_non_colour = split_groups(X, key="colour")
...
dist_colour = intercept + pm.math.dot(X_colour, weights)
dist_non_colour = intercept + pm.math.dot(X_non_colour, weights)
mu_diff = pm.Deterministic('mu_diff', dist_colour.mean() - dist_non_colour.mean())
pm.Potential('dist', pm.Normal.dist(0, 0.01).logp(mu_diff))
```

Note the `0.01`

value on the bottom line. This value can be interpreted as strictness for fairness. The lower it is, the less wiggle room the sampler has to explore areas that are not fair.

The results can be seen below.

```
with pm.Model() as dem_par_model:
intercept = pm.Normal('intercept', 0, 1)
weights = pm.Normal('weights', 0, 1, shape=X.shape[1])
p = pm.math.sigmoid(intercept + pm.math.dot(X, weights))
dist_colour = intercept + pm.math.dot(X_colour, weights)
dist_non_colour = intercept + pm.math.dot(X_non_colour, weights)
mu_diff = pm.Deterministic('mu_diff', dist_colour.mean() - dist_non_colour.mean())
pm.Potential('dist', pm.Normal.dist(0, 0.01).logp(mu_diff))
pm.Potential('balance', sample_weights.values * pm.Bernoulli.dist(p).logp(df['released'].values))
dem_par_trace = pm.sample(tune=1000, draws=1000, chains=6)
```

There’s still an issue. We’ve gotten a flexible approach here. Compared to the scikit-learn pipeline we can have more flexible definitions of fairness and we can have more flexible models (hierarchical models, non-linear models) but at the moment our models does not **guarantee** fairness.

But then Matthijs came up with a neat little hack.

We use our potential to push samples in a direction. This push *must* be continous if we want gradients in our sampler method to be of any use to us. But after this push is done, we would we would like to make a hard cutoff on our fairness. So why don’t we just filter out the sampled points in our trace that we don’t like?

This way, we still get a distribution over the models parameters out but this distribution is guaranteed to never assign any probability mass in regions where we deem the predictions to be ‘unfair’.

```
def hard_constraint_model(df):
def predict(trace, df):
X = df[['year', 'employed', 'citizen', 'checks']].values
return expit((trace['intercept'][:, None] + trace['weights'] @ X.T).mean(axis=0))
X = df[['year', 'employed', 'citizen', 'checks']].values
X_colour, X_non_colour = X[df['colour'] == 1], X[df['colour'] == 0]
class_weights = len(df) / df['released'].value_counts()
sample_weights = df['released'].map(class_weights)
with pm.Model() as dem_par_model:
intercept = pm.Normal('intercept', 0, 1)
weights = pm.Normal('weights', 0, 1, shape=X.shape[1])
p = pm.math.sigmoid(intercept + pm.math.dot(X, weights))
dist_colour = intercept + pm.math.dot(X_colour, weights)
dist_non_colour = intercept + pm.math.dot(X_non_colour, weights)
mu_diff = pm.Deterministic('mu_diff', dist_colour.mean() - dist_non_colour.mean())
pm.Potential('dist', pm.Normal.dist(0, 0.01).logp(mu_diff))
pm.Potential('balance', sample_weights.values * pm.Bernoulli.dist(p).logp(df['released'].values))
dem_par_trace = pm.sample(tune=1000, draws=1000, chains=6)
return trace_filter(dem_par_trace, 0.16), predict
```

Note that in the results, the fairness metric has a hard cutoff.

The approach that we propose here is relatively generic. You can make hierarchical models and you have more flexiblity in your definition of fairness. You start with constraints which you need to translate into a potential after which you can apply a strict filter.

We’ve just come up with an approach where our potential represents fairness. Since we filter the trace afterwards we have an algorithm with properties we like. We don’t want to suggest this approach is perfect though, so here’s some valid points of critique;

- This approach
**does not guarantee fairness**. We’ve merely guaranteed fairness on a single definition and only on our training set. This approach has merit to it but we should not presume that it will be enough. The dataset that we see in production might drift away from what we see in training and there may certainly be facets of fairness that may be hard to capture with a potential. - This approach will fail in settings where the model has so many hyperparameters that we can’t create a potential with domain knowledge. Neural networks for image detection may still be out of scope for this trick.
- Sampler-based approaches don’t get stuck in local optima as often as gradient approaches but even NUTS can get stuck. We also can’t say for sure when the trace has enough samples to be reliable. The training procedure is also very expensive.

We’re pretty exited about this way of thinking about models. The reason why is best described with an analogy in fashion and is summerised in this photo;

We think scikit-learn is an amazing tool. It sparked the familiar `.fit()`

/`.predict()`

interface that the ecosystem has grown accostomed to and it introduced a wonderful concept via its `Pipeline`

-API. But all this greatness comes at a cost; people seem to be getting lazy.

Every problem get’s reduced to something that can be put into a `.fit()`

/`.predict()`

pipeline. The clearest examples of this can be found on the kaggle platform. Kaggle competitions are won by reducing a problem to a single metric, optimising it religiously and not worrying about the application. They’re not won by understanding the problem, modelling towards it or by wondering how the algorithm might have side-effects that you don’t want.

It is exactly on this axis that this approach gives us hope. Instead of calling `model.fit()`

you get to enact `tailer.model()`

because you’re forced to think in constraints. This means that we **actually** get to model again! We can add common sense as a friggin’ prior! How amazing is that!

To add a cherry on top; in our example we’re using fairness as a driving argument but the reason to be exited goes beyond that.

- Maybe you are predicting a timeseries but you want to prevent
*at all times*that the prediction out of the model ever yields a negative value. - Maybe you are building a recommender for netflix and you want to add a prior that punishes popular content such that the viewers don’t get stuck in a filter bubble and enjoy more diverse content.
- Maybe you are a researcher in healthcare and want to insure yourself against Simpsons paradox. One way of guarding yourself against this is by adding a prior that states that
*in all cases*smoking is bad for your health. Even when an outlier suggests otherwise.

The act of thinking about constraints immediately makes you seriously consider the problem before modelling and that … that’s got a lot of potential.

This document is written by both myself and Matthijs Brouns and it is posted on two blogs. The code used here can be found in this github repository.