Why is logistic regression used




















Business Problem: A doctor wants to predict the likelihood of successful treatment of a new patient condition based on various attributes of a patient such as blood pressure, hemoglobin level, blood sugar level, name of a drug given to patient, name of treatments given to the patient, etc. You must be logged in to post a comment. Leave a Reply Cancel reply You must be logged in to post a comment.

Statisticians initially used it to describe the properties of population growth. Sigmoid function and logit function are some variations of the logistic function. Logit function is the inverse of the standard logistic function. In effect, it's an S-shaped curve capable of taking any real number and mapping it into a value between 0 and 1, but never precisely at those limits. It's represented by the equation:. If the predicted value is a considerable negative value, it's considered close to zero.

On the other hand, if the predicted value is a significant positive value, it's considered close to one. Logistic regression is represented similar to how linear regression is defined using the equation of a straight line.

A notable difference from linear regression is that the output will be a binary value 0 or 1 rather than a numerical value. The dependent variable generally follows the Bernoulli distribution. The values of the coefficients are estimated using maximum likelihood estimation MLE , gradient descent , and stochastic gradient descent.

As with other classification algorithms like the k-nearest neighbors , a confusion matrix is used to evaluate the accuracy of the logistic regression algorithm. Logistic regression is a part of a larger family of generalized linear models GLMs.

Just like evaluating the performance of a classifier, it's equally important to know why the model classified an observation in a particular way.

In other words, we need the classifier's decision to be interpretable. Although interpretability isn't easy to define, its primary intent is that humans should know why an algorithm made a particular decision. In the case of logistic regression, it can be combined with statistical tests like the Wald test or the likelihood ratio test for interpretability.

Logistic regression is applied to predict the categorical dependent variable. In other words, it's used when the prediction is categorical, for example, yes or no, true or false, 0 or 1. The predicted probability or output of logistic regression can be either one of them, and there's no middle ground. Logistic regression analysis is valuable for predicting the likelihood of an event. It helps determine the probabilities between any two classes.

In essence, logistic regression helps solve probability and classification problems. In other words, you can expect only classification and probability outcomes from logistic regression. A logistic regression model can also help classify data for extract, transform, and load ETL operations. Logistic regression shouldn't be used if the number of observations is less than the number of features.

Otherwise, it may lead to overfitting. While logistic regression predicts the categorical variable for one or more independent variables, linear regression predicts the continuous variable. In other words, logistic regression provides a constant output, whereas linear regression offers a continuous output.

Since the outcome is continuous in linear regression, there are infinite possible values for the outcome. But for logistic regression, the number of possible outcome values is limited. In linear regression, the dependent and independent variables should be linearly related. While linear regression is estimated using the ordinary least squares method, logistic regression is estimated using the maximum likelihood estimation approach.

Both logistic and linear regression are supervised machine learning algorithms and the two main types of regression analysis. While logistic regression is used to solve classification problems, linear regression is primarily used for regression problems. Going back to the example of time spent studying, linear regression and logistic regression can predict different things.

Logistic regression can help predict whether the student passed an exam or not. In contrast, linear regression can predict the student's score.

The importance of this is that a large odds ratio OR can represent a small probability and vice-versa. Let us go back to our example to make this point clear. The reference group older individuals receiving new treatment showed a chance of death approximately equal to 0. Knowing that the mean chance of death in the group of younger individuals that received new treatment is 1.

Similarly, the mean chance of death of an older individual receiving standard treatment is 3. Finally, younger individuals receiving standard treatment have a chance of death equal to 5. Therefore, as demonstrated, a large OR only means that the chance of a particular group is much greater than that of the reference group.

But if the chance of reference group is small, even a large OR can still indicate a small probability. Now is time to think about what to do if explanatory variables are not binomial, as before. When an explanatory variable is multinomial, then we must build n-1 binary variables called dummy variable to it, where n indicates the number of levels of the variable. A dummy variable is just a variable that will assume value one if subject presents the specified category and zero otherwise.

Usually, statistical software does it automatically and the reader does not have to worry about it. While interpretation of outputs from multinomial explanatory variables is straightforward and follows the ones of binomial explanatory variables, the interpretation of continuous variables, on the other hand, is a bit more complex.

Results from multivariate logistic regression model containing all explanatory variables full model , using AGE as a continuous variable. It occurs because the older the individual in years the smaller the chance of death.

Take extra care when interpreting logistic regression results using continuous explanatory variables. A major problem when building a logistic model is to select which variables to include.

This approach increases the emergence of two situations. Remember that you are working with samples and spurious results can occur. The second situation is that a model with more variables presents less statistical power.

So, if there is an association between one explanatory variable and the occurrence of an event, researcher can miss this effect because saturated models those that contains all possible explanatory variables are not sensible enough to detect it.

So the researcher must to be very cautious with the selection of variables to include into the model. We can start a regression using either a full saturated model, or a null empty model, which starts only with the intercept term. In the first case, variables need to be dropped one by one, preferably dropping the less significant one. This is the preferred strategy just because is easier to handle, while the second requires all candidate variables to be tested each step in a way to select the better choice to include.

On the other hand, if too many variables are included at once in a full model, significant variables could be dropped due to low statistical power, as mentioned above. However, if we have a limited sample size in relation to the number of candidate variables, a pre-selection should be performed instead. There is no reason to worry about a rigorous p-value criterion at this stage, because this is just a pre-selection strategy and no inference will derive from this step.

This relaxed P-value criterion will allow reducing the initial number of variables in the model reducing the risk of missing important variables 4 , 5. There is some debate about the appropriate strategy to variable selection 6 and the last is just another one.

It is easy and intuitive. More elaborated methods are available, but whatever the method, it is very important that researchers get aware of the procedure applied and not just press some buttons on software. There are some explanatory variables for which the reference level is almost automatically determined.

On the other hand, some variables have no clear reference level, but present ordered levels and the reference level will be, usually, one of the endpoints or, less frequently, the central level. However, some variables have no ordered levels and no clear reference level. This can occur with geographic region. And then appears the question: what region should I use as reference?

The answer is that there is no answer… However, reference level selection can change the model estimation in some cases. It is important to remember that all results and significant effects presented are relative to the reference level. The results, showing just the region variable, are below Table 6. Forecasting trends and future values. For example, how much will the stock price of Lufthansa be in 6 months from now? Determining the strength of different predictors —or, in other words, assessing how much of an impact the independent variable s has on a dependent variable.

For example, if a soft drinks company is sponsoring a football match, they might want to determine if the ads being displayed during the match have accounted for any increase in sales. What is logistic regression? However, the independent variables can fall into any of the following categories: Continuous —such as temperature in degrees Celsius or weight in grams.

On the other hand, weight in grams would be classified as ratio data; it has the equal intervals and a true zero. In other words, if something weighs zero grams, it truly weighs nothing. Discrete, ordinal —that is, data which can be placed into some kind of order on a scale.

For example, if you are asked to state how happy you are on a scale of , the points on the scale represent ordinal data. A score of 1 indicates a lower degree of happiness than a score of 5, but there is no way of determining the numerical value between each of the points on the scale. Ordinal data is the kind of data you might get from a customer satisfaction survey. Discrete, nominal —that is, data which fits into named groups which do not represent any kind of order or scale. So, in order to determine if logistic regression is the correct type of analysis to use, ask yourself the following: Is the dependent variable dichotomous?

In other words, does it fit into one of two set categories? Are the independent variables either interval, ratio, or ordinal? See the examples above for a reminder of what these terms mean. Remember: The independent variables are those which may impact, or be used to predict, the outcome. Logistic regression assumptions The dependent variable is binary or dichotomous —i.

It fits into one of two clear-cut categories. There should be no, or very little, multicollinearity between the predictor variables —in other words, the predictor variables or the independent variables should be independent of each other.

This means that there should not be a high correlation between the independent variables. Logistic regression requires fairly large sample sizes —the larger the sample size, the more reliable and powerful you can expect the results of your analysis to be.

What are log odds? What is logistic regression used for?



0コメント

  • 1000 / 1000