# Theoretical Dependence

The simplest regression might look something like this;

You might assume at this point in time that the slope and intercept are unrelated to eachother. In econometrics you're even taught that this assumption is necessary. If you're one I have to warn you for what you're about to read. I'm about to show you why, by and large, this independance assumption is just plain weird.

## Posterior

Let's generate some fake data and let's toss it in PyMC3.

```
import numpy as np
import matplotlib.pylab as plt
n = 1000
xs = np.random.uniform(0, 2, (n, ))
ys = 1.5 + 4.5 * xs + np.random.normal(0, 1, (n,))
```

## PyMC3

Let's now throw this data into pymc3.

```
import pymc3 as pm
with pm.Model() as model:
intercept = pm.Normal("intercept", 0, 1)
slope = pm.Normal("slope", 0, 1)
values = pm.Normal("y", intercept + slope * xs, 1, observed=ys)
trace = pm.sample(2000, chains=1)
plt.scatter(trace.get_values('intercept'), trace.get_values('slope'), alpha=0.2)
plt.title("pymc3 results");
```

That's interesting; the `intercept`

and `slope`

variables aren't independant at all. They are negatively related!

## Scikit-Learn

Don't trust this result? Let's sample subsets and throw it into scikit learn, let's see what comes out of that.

```
size_subset = 500
n_samples = 2000
samples = np.zeros((n_samples, 2))
for i in range(n_samples):
idx = np.random.choice(np.arange(n), size=size_subset, replace=False)
X = xs[idx].reshape(n_samples, 1)
Y = ys[idx]
sk_model = LinearRegression().fit(X, Y)
samples[i, 0] = sk_model.intercept_
samples[i, 1] = sk_model.coef_[0]
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.2)
plt.title("sklearn subsets result");
```

## Why?

So what is going on here? We generated the data with two independent parameters. Why are these posteriors suggesting that there is a relationship between the intercept and slope?

There are two arguments to make this intuitive.

### Argument One: Geometry

Consider these two regression lines that go into a single point.As far as the little point is concerned both lines are equal. They both have the same fit. We're able to exchange a little bit of the intercept with a little bit of the slope. Granted, this is for a single point, but also for a collection of points you can make an argument that you can exchange the intercept for the slope. This is why there *must* be a negative correlation.

### Argument Two: A bit Causal

Consider this causal graph.

The \(x_0\) node and \(x_1\) node are independant, that is, unless \(y\) is given. That is because, once \(y_i\) is known, we're back to a single point and then the geometry argument kicks in. But also because logically we could explain the point \(y_i\) in many ways; a lack of \(x_0\) can be explained by an increase of \(x_1\), vise versa or something in between. This encoded exactly in the graphical structure.

## Conclusion

It actually took me a long time to come to grips with this. Upfront the linear regression does look like the addition of independant features. But since they all need to sum up to a number, it is only logical that they are related.

Assuming properties of your model upfront is best done via a prior, not by an independence assumption.

## Appendix

The interesting thing about this phenomenon is that it is so pronounced in the simplest example. It is far less pronounced in large regressions with many features, like;

$$ y_i \approx \text{intercept} + \beta_1 x_{1i} + ... + \beta_f x_{fi} $$ Here's some plots of the intercept value vs. the first estimated feature, \(\beta_1\) given \(f\) features;

You can clearly see the covariance decrease as \(f\) increases. The code for this can be found here.