Deep Dive into Logistic Regression in Python

DEVI GUSKRA
6 min readMar 28, 2021

Logistic Regression is supervised classification algorithm where we will used to classify the categories. In logistic regression the output of the estimator is discrete or categorical. The idea of Logistic regression is started from the intuition of Linear regression. For example as from the figure -1 suppose there are two classes, class 0 and class 1 and we need to estimate the classes using independent variable X. For that we can start with the idea of simple linear regression where we fit a line for the samples independent variable and dependent variable (class). Now with the simple linear regression (solid yellow line) we can estimate the probability. Lower the probability we can estimate as class 0 and higher the probability we can estimate as class 0.

Now the question is what is the threshold probability that we need set to class of the sample. By default we can consider 0.5 is the threshold probability and estimate the class. Now here there is problem for default threshold probability = 0.5. Suppose that there are extreme values (not outliers) of X, by common sense we can say that these are also belongs to class 1.

Linear Regression for a classification Data
Figure-1: Linear Regression to Classification Problem

Now when you build a linear regression with the extreme values the linear regression will looks like yellow doted line. And for now default threshold value 0.5 is not approximation for selection or estimation of classification. This phenomenon occurs because, linear regression fit the based on least square regression and hence least square is not appropriate for this problem. To tackle this problem Logistic regression use Maximum Likelihood Estimation (MLE).

In MLE, the Goal is to maximize likelihood.

  • In most Data Science optimizations, the goal is to find minima using calculus (minimize sum of squared errors in linear regression, and so on) or numerical techniques like Gradient Descent (minimize deviance in logistic regression, and so on)
  • Maximum Likelihood => Minimum of Negative Log-Likelihood.

Example Problem:

An auto club mails a flier to its members offering to send more information regarding a supplemental health insurance plan if the member returns a brief enclosed form. Can a model be built to predict if a member will return the form or not?

URL: GitHub (for code)

Figure-2, explain the scatter plot between the independent variable (Age) and dependent variable (yes or no). As linear regression is not appropriate because we might get probability in greater than 1 and less than zero. In order to limit the values between 0 to 1 range, we will transform the output of linear regression using sigmoid function. Hence from the sigmoid function we can calculate profanities. From this output we find the best threshold probability using MLE.

So, the equation of the transformed linear regression is

equation — — — — — (1)

Where,

​ (also know as the systematic or the structural component or linear predictor).

This is a logistic model. The function is also known as the inverse link function, which links the response with the systematic component. p is the probability that a club member fits into group 1 (returns the form; success; P(Y=1|X)).

If you expand the equation — 1 :

Odds Ratio is obtained by the probability of an event occurring divided by the probability that it will not occur. Logistic model can be transformed into an odds ratio:

​if we substitute in equation 2 in 3 and solve it then you will get

The log of the odds ratio is called logit, and the transformed model is linear in 𝛽s. That the reason this algorithm is called logistic regression.

Logistic Regression in Python

As mentioned above logistic regression has two steps.

Step-1: Develop transformed linear regression and computer probability of each data point

Step-2: Find the best odd ratio using MLE.

Now Lets build logistic regression in Python and you can download the dataset from out github link . Let start by import necessary libraries:

This data is about the automobile on different types of cars and their specification. For the sake of simplicity consider let us consider wt, qsec, am. In this example we will take two independent features (wt, qsec) and we will predict dependent variable (am — a= automatic transmission and m=manual transmission).

Figure-3: Scatter Plot between wt and qsec

Above figure show the scatter plot between wt and qsec and colored by the label or output. It seems that data is linear separable and let see how logistic regression can able to separate the classes. Let’s first split the data into independent and dependent variable.

Out: ((32, 2), (32,))

Now import logistic regression model from sklearn and train the model from the above data.

Logistic Regression has train successfully. What we have did is that we simple build transformed linear regression (linear regression passes to sigmoid function) and get probabilities for each data point. You can get the probabilities from using following syntax in python and then we will visualize probability of each data point using matplotlib.

Out: [0.8138488 0.62844367 0.70587441 0.14634078 0.30573648 0.05604451 0.3992924 0.1144186 0.02357085 0.16761907 0.1224557 0.07223128 0.13407676 0.0971771 0.00327584 0.00238895 0.00367752 0.65373927 0.93147451 0.77590941 0.41927653 0.28516772 0.27301396 0.3124652 0.14147834 0.8341205 0.92213607 0.97901646 0.79631523 0.84616163 0.58652099 0.44669103]

Above scatter plot displays the probability of each data plot from logistic regression. Now we need to estimate the best threshold probability to be select above which it is class -1 other wise it is class-0. We can find this using Maximum Likelihood Estimation.

Maximum Likelihood Estimate (MLE)

Now let see how to find best oddsratio using maximum likelihood estimate. MLE is in simple calculating log loss or categorical cross entropy for different probability. And find the best probability where log loss is minimum.

From above output where x axis is probability and y-axis is log loss and you can see that the threshold probability is around p = 0.42. Hence we can find the index position of best probability, probability and corresponding log loss with following code:

Now we have successfully predicted the values using logistic regression.

Model Equation

Let’s now calculate the equation of the logistic regression. As we know that equation of the logistic regression is:

for two independent variables and one dependent variables. Lets now calculate find the equation of the model we found.

Out: (-0.3141153295199363, array([[-2.38172384, -0.61130325]]), array([17.77738323]))

which means,

-0.314 = 17.77–2.3817 x1–0.611 x2

x2 = (-0.314–17.77 + 2.3817 * x1)/(-0.611)

Now using the above equation let’s visualize logistic regression line with best threshold value found from MLE.

This is how we can able build the logistic regression and use maximum likelihood estimation for best threshold probability in Python.

To download the code please click our GitHub link.

--

--

DEVI GUSKRA

A professional machine learning engineer love to apply mathematics concepts to life