The Maths Behind Logistic Regression

I got it wrong 🙁

What are Your Odds?

Let’s have a look at this god-tier math puzzle. Only one out of seven gets it right, and the other six don’t. So, what are the odds for solving it correctly? 1 to 6. Generally, if \(p\) is the probability that someone will get it right, then his/her odds are \(p/(1-p)\).

However, it isn’t necessarily true that \(p=1/7\) for every person because some people are smarter, some have better education and so on. Hence, \(p\) also depends on the person attempting the puzzle. In a Bayesian framework, we capture this dependence with conditional probability.

Let \(Y\in \{1, 0\}\) be a random variable that corresponds to getting this puzzle right(or wrong). Also assume \(X\) is a random variable that corresponds to a person.  If a person \(x\) solves the puzzle correctly, \(Y\) would be \(1\) for him/her, and \(0\) otherwise. We want to know the probability that \(Y=1\) given that the person attempting the puzzle is \(x\), ie. \(X=x\). In other words, we want to estimate \(P(Y=1|X=x)\).

Note that when I say \(x\), I mean various relevant information about a person, such as IQ, age, education level etc. in a vectorized form. For instance, \(x=[110, 26, 100]\) might mean the person has IQ 120, age 26 and has completed highest level of education. It is a very simplistic description of feature space, but you won’t need to know more than that for now. Let’s get back to the conditional probability.

There are many ways to estimate \(P(Y=1|X=x)\). Logistic regression does it by estimating the log of the odds first and then calculating the probability from there.

The Logit Function

The log odds of a probability is called the logit function \(logit(p)\). Here is how this makes sense. In the above puzzle,  if the odds are \(p/(1-p)\), then log of the odds are \(\log (p/(1-p))=logit(p)\). If you knew just this information, you could still calculate \(p\) by the following formula
\[logit(p)=\log\left(\frac{p}{1-p}\right)\implies e^{logit(p)}=\frac{p}{1-p}\implies p=\frac{e^{logit(p)}}{1+e^{logit(p)}}\] The last function is the famous sigmoid function. So, if you can estimate \(logit(p)\), taking sigmoid of that quantity will get back \(p\).

Why does the logistic regression estimate log odds instead of odds? I’ll answer it in the end.

Estimating the Log Odds

Let’s do something very simple: assume that the log odds is a linear function of a person’s IQ, age and education level. So, our model can be written as
\[logit(p)\approx w_0+w_1x_1+w_2x_2+w_3x_3=w^Tx\] where \(w=[w_0,\dots,w_3]\) is a vector of coefficients, \(x=[1, x_1,\dots, x_3]\) are some personal traits of \(x\). We need the 1 in the beginning to multiply with \(w_0\). The rightmost formula is the matrix form of the same equation. As we said before, sigmoid of \(logit(p)=p\), so,
\[P(Y=1|X=x)=\sigma(w^Tx), P(Y=0|X=x)=1-\sigma(w^Tx)\]  There is a very nice trick to combine these two equations \[P(Y=y|X=x)=\sigma(w^Tx)^y(1-\sigma(w^Tx))^{1-y}\] You should verify that this last equation is equivalent to the previous two equatiosn for \(y=0,1\).

Now, our model has an unknown parameter \(w\). How can we find it? We obviously need data! For this part, we need to have \(n\) people take this challenge and record the outcome. Let’s say that \(\mathfrak{X}=[x_1,x_2,\dots,x_n]\) are the set of persons and \(\mathfrak{Y}=[y_1,\dots,y_n]\) are their respective outcomes. Note that here \(x_i\)s are vectors, each of which contain 1 offset field and 3 personal traits. We’ll try to learn \(w\) from the data.  This is why \(\mathfrak{X}\) and \(\mathfrak{Y}\) are called training data.

Maximum Likelihood Estimation

Our strategy will be to find \(w\) such that our estimate matches each outcome \(y_i\) as closely as possible. For instance, if person 3 passed the challenge, then \(y_3=1\). So, the \(w\) we select should make \(P(Y=1|X=x_3)\) as close to 1 as possible. This is called maximum likelihood estimation (MLE). In simple English, this means “tune” (please forgive me, statisticians!) your parameters such that the probability of getting current dataset is maximized. Is that the only way to look for parameters? No, there are many more such strategies. In fact, there is a very similar strategy called maximum a posteriori estimation.  MLE is widely used because of its nice theoretical properties.

So, we want \begin{align*}\max_{w}P(Y=\mathfrak Y | X=\mathfrak X, w)&=\prod_{i=1}^n\max_{w}P(Y=y_i|X=x_i, w)\\
&=\prod_{i=1}^n\max_{w}\sigma(w^Tx_i)^{y_i}(1-\sigma(w^Tx_i))^{1-y_i}
\end{align*} Note that by maximizing likelihood at each datapoint, and optimizing for \(w\), we are making two very crucial assumptions, that is:

  1. Independence: Performance of each person point does not affect performance of any other person. If this was a group project, it would not be true.
  2. Identical Distribution: People’s traits come from identical distributions so same \(w\) will fit all of them. If the data came from two very different groups, such as kindergarten kids and college students, this might not be true.

We can take log of the right hand side to to convert all the multiplications into summations. This won’t affect the optimal choice of \(w\) since the same \(w\) would also maximize the log of the right hand side. Thus our objective is \[\max_w\sum_{i=1}^ny_i\log(\sigma(w^Tx_i))+(1-y_i)\log(1-\sigma(w^Tx_i))\]

The Loss Function

In machine learning, the standard way of writing an optimization problem is to write it as a minimization problem. The objective is called the loss function. So, by flipping the sign of the objective, our problem becomes \[L(w)=\min_w\sum_{i=1}^n-y_i\log(\sigma(w^Tx_i))-(1-y_i)\log(1-\sigma(w^Tx_i))\] This is the so called negative log loss. The gradient of this function is \[\nabla_wL = -\sum_{i=1}^nx_i(y_i-\sigma(w^Tx_i))\] Now all is left to do is gradient descent with appropriate step size.

One lovely thing about this loss function is that it is convex, thus any local minimum is also a global minimum. Consequently, gradient descent is guaranteed to find the optimum solution.

Exercise: Show that \(L(w)\) is convex. (Hint: A (twice-differentiable) convex function of an affine function is a convex function.)

Why Estimate Log Odds?

If you estimate odds \(o(p)\) directly, then \(p=o(p)/(1+o(p))\). Graph of this function is asymmetric as opposed to sigmoid. You could still follow all the steps of MLE, but the resulting loss function will be non-convex. Thus, the gradient descent approach might get stuck in a local minimum.

Leave a Reply

Your email address will not be published. Required fields are marked *