Theoretical Dependence
The simplest regression might look something like this;
You might assume at this point in time that the slope and intercept are unrelated to eachother. In econometrics you're even taught that this assumption is necessary. If you're one I have to warn you for what you're about to read. I'm about to show you why, by and large, this independance assumption is just plain weird.
Posterior
Let's generate some fake data and let's toss it in PyMC3.
import numpy as np
import matplotlib.pylab as plt
n = 1000
xs = np.random.uniform(0, 2, (n, ))
ys = 1.5 + 4.5 * xs + np.random.normal(0, 1, (n,))
PyMC3
Let's now throw this data into pymc3.
import pymc3 as pm
with pm.Model() as model:
intercept = pm.Normal("intercept", 0, 1)
slope = pm.Normal("slope", 0, 1)
values = pm.Normal("y", intercept + slope * xs, 1, observed=ys)
trace = pm.sample(2000, chains=1)
plt.scatter(trace.get_values('intercept'), trace.get_values('slope'), alpha=0.2)
plt.title("pymc3 results");
That's interesting; the intercept
and slope
variables aren't independant at all. They are negatively related!
Scikit-Learn
Don't trust this result? Let's sample subsets and throw it into scikit learn, let's see what comes out of that.
size_subset = 500
n_samples = 2000
samples = np.zeros((n_samples, 2))
for i in range(n_samples):
idx = np.random.choice(np.arange(n), size=size_subset, replace=False)
X = xs[idx].reshape(n_samples, 1)
Y = ys[idx]
sk_model = LinearRegression().fit(X, Y)
samples[i, 0] = sk_model.intercept_
samples[i, 1] = sk_model.coef_[0]
plt.scatter(samples[:, 0], samples[:, 1], alpha=0.2)
plt.title("sklearn subsets result");
Why?
So what is going on here? We generated the data with two independent parameters. Why are these posteriors suggesting that there is a relationship between the intercept and slope?
There are two arguments to make this intuitive.
Argument One: Geometry
Consider these two regression lines that go into a single point.As far as the little point is concerned both lines are equal. They both have the same fit. We're able to exchange a little bit of the intercept with a little bit of the slope. Granted, this is for a single point, but also for a collection of points you can make an argument that you can exchange the intercept for the slope. This is why there must be a negative correlation.
Argument Two: A bit Causal
Consider this causal graph.
The \(x_0\) node and \(x_1\) node are independant, that is, unless \(y\) is given. That is because, once \(y_i\) is known, we're back to a single point and then the geometry argument kicks in. But also because logically we could explain the point \(y_i\) in many ways; a lack of \(x_0\) can be explained by an increase of \(x_1\), vise versa or something in between. This encoded exactly in the graphical structure.
Conclusion
It actually took me a long time to come to grips with this. Upfront the linear regression does look like the addition of independant features. But since they all need to sum up to a number, it is only logical that they are related.
Assuming properties of your model upfront is best done via a prior, not by an independence assumption.
Appendix
The interesting thing about this phenomenon is that it is so pronounced in the simplest example. It is far less pronounced in large regressions with many features, like;
$$ y_i \approx \text{intercept} + \beta_1 x_{1i} + ... + \beta_f x_{fi} $$ Here's some plots of the intercept value vs. the first estimated feature, \(\beta_1\) given \(f\) features;
You can clearly see the covariance decrease as \(f\) increases. The code for this can be found here.