LOGISTIC REGRESSION
This article delves into the principles of logistic regression, including its types and implementations.
What is Regression ?
It predicts the continuous output variables using the independent input variable.
Difference Between Linear Regression & Logistic Regression
Linear regression is a supervised machine learning approach that determines the linear connection between the dependent variable and one or more independent features by applying a linear equation to the observed data. The data here is continuous.
- Simple Linear Regression occurs when there is just one independent feature.
- Multiple Linear Regression occurs when there are multiple features.
- Univariate Linear Regression is when there is only one dependent variable.
- Multivariate Regression is when there are multiple dependent variables.
Logistic regression is a supervised machine learning technique used in classification problems to predict whether an instance belongs to a specified class or not. Logistic regression is a statistical procedure that focuses on the relationship between two data variables. The data here is categorical or binary implying that the output can only be represented in the case of 0 or 1 , yes or no , true or false etc.
Example of logistics regression : Email Detection — whether the e-mail is spam or not .
In logistic regression, the concept of independent and dependent variables is similar to that of linear regression.
- Dependent variables are the target variables that need to be predicted.The dependent variable can only have one outcome or value.
- Independent variables help predict the dependent variables.The independent variables can have more than one values.
Example of Logistic Regression
Let’s take an example of Logistic regression. Say a student studies for a number of hours a day , we have to predict whether they will pass or fail in their examinations where zero will be the probability of failure and one will be probability of success or pass
The above example clearly shows that the students who studied less number of hours are more likely to fail and the students who are putting in more than 7 hours of study are more likely to pass.
Sigmoid function
The sigmoid function is a mathematical function that maps predicted values to probabilities. It converts the real values into another value between 0 and 1. The value must be between 0 and 1, and it cannot exceed this limit, resulting in a curve similar to the "S" form. The S-form curve is also known as the sigmoid function or logistic function.
In the given formula below :
y → predicting value
B° → intercept (if x=0 then where y is going to intercept)
B1 → coefficient (if x is incremented then how will it affect y and how much it will affect y)
x→ independent variable
Threshold Value
Threshold value establishes the likelihood of 0 or 1. Any value exceeding 1 can be said as tends to 1 and any value exceeding the range of 0 can be said as tends to zero.
Types of Logistic Regression
- Binomial : Only two possible dependent variables ie; True or False.
- Multinomial : More than 3 possible unordered dependent variables ie; dog , cat , age, gender.
- Ordinal : More than 3 possible ordered dependent variables ie; low, medium, high.
Assumptions of Logistic Regression
- Each observation is independent of the other , meaning there is no correlation between any input ie; When students have to choose a stream-Science, Commerce & Humanities — to study they have three options which have nothing to do with each other.
- It takes the assumption that the dependent variable must be binary, meaning it can have only two values ie; Say there’s a field trip we need to see if the children are going to the trip or not.
- The relationship between the independent variables and the log odds of the dependent variable should be linear.
- There should be no outliers. Outliers are observations that fall for outside the typical range of other data points in a data set. These anomalies can come from errors in the data election, human error , equipment malfunction or data transmission issues.
- The sample size is sufficiently large. For observational studies with a large population involving Logistic regressions analysis, a minimum sample size of 500 or at least 10 observations per variable is required .
How Logistic Regression Works
To understand the working of Logistic regressions we will proceed with an example. Say there is a data set which contains the information of various users obtained from social media sites. There is a new product launched by a hair care brand and the company that wants to check how many users from the data set want to purchase the product. For this problem we will build a machine learning model using the Logistic regression algorithm. The steps taken are as follows:
- Data Pre-processing step : In this step we will pre-process the data so that it’s usable in the code. From the data set we will get the output and we will extract the dependent and independent variables from the given data set.
- Fitting Logistic Regression to the Training set : We will train our dataset using a training set. For providing training and fitting the model to the training set we will import the Logistic regression class of the library.
- Predicting the test result : Our model will be well trained on the training set so we will now predict the result by using test set data.
- Test accuracy of the result (Creation of Confusion matrix)Â : now we will create the confusion Matrix to check the accuracy of the classification.
- Visualizing the test set result : to visualise the training set results we will use matplotlib.
Logistic Regression Values Evaluation
We can evaluate the Logistic regression model using the following metrics:
- Accuracy
- Precision
- Recall sensitivity or true positive rate
- F1 score
- Area under the receiver operating characteristic curve (AUC-ROC)
- Area under the precision record curve (AUC-PR)
NOTE:
- TRUE POSITIVE - correctly predicts the positive class.
- TRUE NEGATIVE - correctly products the negative class.
- FALSE POSITIVE - when said true but is actually false.
- FALSE NEGATIVE- when said false but is actually true.