![]() |
VOOZH | about |
Regression is a statistical process for estimating the relationships between a dependent variable and one or more independent variables, also known as predictors or covariates. Regression analysis is mainly used for two conceptually distinct purposes: prediction and forecasting, where it overlaps with machine learning and secondly, to infer relationships between the independent and dependent variables.
Categorical variables are variables that take a limited number of distinct values and represent different groups or categories. They are also known as qualitative variables or factors.
When the dependent variable is categorical, logistic regression is commonly used. It uses Maximum Likelihood Estimation (MLE) to model the relationship between independent variables and a categorical outcome.
Example: We aim to predict whether a candidate will be admitted based on GRE, GPA and rank. The dataset (binary.csv) is loaded from the working directory using getwd() and the model is built using R.
Output:
'data.frame': 400 obs. of 4 variables:
$ admit: int 0 1 1 1 0 1 1 0 1 0 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : int 3 3 1 4 4 2 1 2 3 2 ...
Looking at the structure of the dataset, we observe that it contains 4 variables. The admit variable indicates whether a candidate is admitted (1) or not (0). The gre, gpa and rank variables represent the candidate’s GRE score, GPA and the rank of their previous college, respectively.
We use admit as the dependent variable and gre, gpa and rank as independent variables. Although admit and rank are stored as numeric values, they represent categorical data, so we convert them into factors using the as.factor() function.
Output:
rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12
The dataset is divided into training and test sets, where the training set is used to build the model and the test set is used to evaluate its performance. We use 60% of the data for training and 40% for testing. The split is done using random sampling with the sample() function and set.seed() is used to ensure reproducibility.
Now build a logistic regression model for our data. glm() function helps us to establish a neural network for our data. The glm() function we are using here has the following syntax.
Syntax:
glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset,
control = list(…), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, …)
| Parameter | Description |
| formula | a symbolic description of the model to be fitted. |
| family | a description of the error distribution and link function to be used in the model. |
| data | an optional data frame. |
| weights | an optional vector of ‘prior weights’ to be used in the fitting process. Should be NULL or a numeric vector. |
| subset | an optional vector specifying a subset of observations to be used in the fitting process. |
| na.action | a function which indicates what should happen when the data contain NAs. |
| start | starting values for the parameters in the linear predictor. |
| etastart | starting values for the linear predictor. |
| mustart | starting values for the vector of means. |
| offset | this can be used to specify an a priori known component to be included in the linear predictor during fitting. |
| control | a list of parameters for controlling the fitting process. |
| model | a logical value indicating whether model frame should be included as a component of the returned value. |
| method | the method to be used in fitting the model. |
| x,y | logical values indicating whether the response vector and model matrix used in the fitting process should be returned as components of the returned value. |
| singular.ok | logical; if FALSE a singular fit is an error. |
| contrasts | an optional list. |
| ... | arguments to be used to form the default control argument if it is not supplied directly. |
Output:
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6576 -0.8724 -0.6184 1.0683 2.1035
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.972329 1.518865 -3.274 0.00106 **
gre 0.001449 0.001405 1.031 0.30270
gpa 1.233117 0.450550 2.737 0.00620 **
rank2 -0.784080 0.406376 -1.929 0.05368 .
rank3 -1.203013 0.426614 -2.820 0.00480 **
rank4 -1.699652 0.536974 -3.165 0.00155 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 312.66 on 248 degrees of freedom
Residual deviance: 283.38 on 243 degrees of freedom
AIC: 295.38
Number of Fisher Scoring iterations: 4
From the summary of the model it is evident that gre has no significant role in predictions, so we can remove it from our model and rewrite it as follows:
Now, let's try to analyze our regression model by making some predictions.
Output:
2: 0.319544695019424 3 :0.721170188065219 4: 0.125758071031047. 5 :0.0904838192878458. 6 :0.221311604416545. 9 :0.239946189417743
Output:
admit gre gpa rank
1 0 380 3.61 3
7 1 560 2.98 1
8 0 400 3.08 2
10 0 700 3.92 2
12 0 440 3.22 1
13 1 760 4.00 1
Then, we round up our results by creating a confusion matrix to compare the number of true/false positives and negatives. We will form a confusion matrix with training data.
Output:
Actual
Prediction 0 1
0 98 38
1 6 9
The model generates 98 true negatives (0’s), 9 true positives (1’s), while there are 6 false negatives and 38 false positives. Now, let's calculate the misclassification error (for training data) which {1 – classification error}
Output:
0.291390728476821The misclassification error on the test dataset is approximately 29.14%, indicating the proportion of incorrect predictions made by the model.
You can download the source code from here