Vectorized Logistic Regression
22 hours ago
The underlying math behind any Artificial Neural Network (ANN) algorithm can be overwhelming to understand. Moreover, the matrix and vector operations used to represent feed-forward and back-propagation computations during batch training of the model can add to the comprehension overload. While succinct matrix and vector notations make sense, peeling through such notations down to subtle working details of such matrix operations would bring more clarity. I realized that the best way to understand such subtle details is to consider a bare minimum network model. I could not find a better algorithm than Logistic Regression to explore what goes under the hood because it has all the bells and whistles of ANN, such as multidimensional inputs, the network weights, the bias, forward propagation operations, activations that apply non-linear function, loss function, and gradients-based back-propagation. My intent for this blog is to share my notes and findings of the matrix and vector operations that are core to Logistic Regression model.
Brief Synopsis of Logistic Regression
Despite its name, Logistic Regression is a classification algorithm and not a regression algorithm. Typically it is used for binary classification to predict the probability of an instance belonging to one of two classes, for example, predicting if an email is spam or not. As such, in Logistic Regression, the dependent or target variable is considered a categorical variable. For example, an email being spam is represented as 1 and not spam as 0. The primary goal of the Logistic Regression model is to establish a relationship between the input variables (features) and the probability of the target variable. For example, given the characteristics of an email as a set of input features, a Logistic Regression model would find a relationship between such features and the probability of the email being spam. If ‘Y’ represents the output class, such as an email being spam, ‘X’ represents the input features, the probability can be designated as π = Pr( Y = 1 | X, βi), where βi represents the logistic regression parameters that include model weights ‘wi’ and a bias parameter ‘b’. Effectively, a Logistic Regression predicts the probability of Y = 1 given the input features and the model parameters. Specifically, the probability π is modeled as an S-Shaped logistic function called the Sigmoid function, given by π = e^z/(1 + e^z) or equivalently by π = 1/(1 + e^-z), where z = βi . X. The sigmoid function allows for a smooth curve bounded between 0 and 1, making it suitable for estimating probabilities. Essentially, a Logistic Regression model applies the sigmoid function on a linear combination of the input features to predict a probability between 0 and 1. A common approach to determining an instance’s output class is thresholding the predicted probability. For example, if the predicted probability is greater than or equal to 0.5, the instance is classified as belonging to class 1; otherwise, it is classified as class 0.
A Logistic Regression model is trained by fitting the model to the training data and then minimizing a loss function to adjust the model parameters. A loss function estimates the difference between the predicted and actual probabilities of the output class. The most common loss function used in training a Logistic Regression model is the Log Loss function, also known as Binary Cross Entropy Loss function. The formula for the Log Loss function is as follows:
L = — ( y * ln(p) + (1 — y) * ln(1 — p) )
Where:
- L represents the Log Loss.
- y is the ground-truth binary label (0 or 1).
- p is the predicted probability of the output class.
A Logistic Regression model adjusts its parameters by minimizing the loss function using techniques such as gradient descent. Given a batch of input features and their ground-truth class labels, training of the model is carried out in several iterations, called epochs. In each epoch, the model carries forward propagation operations to estimate losses and backward propagation operations to minimize the loss function and adjust the parameters. All such operations in an epoch employ matrix and vector computations as illustrated in the next sections.
Matrix and Vector Notations
Please note that I used LaTeX scripts to create the mathematical equations and matrix/vector representations embedded as images in this blog. If anyone is interested in the LaTeX scripts, don’t hesitate to contact me; I will be happy to share.
As shown in the schematic diagram above, a binary Logistic Regression classifier is used as an example to keep the illustrations simple. As shown below, a matrix X represents the ‘m’ number of input instances. Each input instance comprises an ’n’ number of features and is represented as a column, an input features vector, within the matrix X, making it a (n x m) sized matrix. The super-script (i) represents the ordinal number of the input vector in the matrix X. The sub-script ‘j’ represents the ordinal index of the feature within an input vector. The matrix Y of size (1 x m) captures the ground-truth labels corresponding to each input vector in the matrix X. The model weights are represented by a column vector W of size (n x 1) comprising ’n’ weight parameters corresponding to each feature in the input vector. While there is only one bias parameter ‘b’, for illustrating matrix/vector operations, a matrix B of size (1 x m) comprising ‘m’ number of the same bias b parameter is considered.
Forward Propagation
The first step in the forward propagation operation is to compute a linear combination of model parameters and input features. The notation for such matrix operation is shown below where a new matrix Z is evaluated:
Note the use of the transpose of weight matrix W. The above operation in the matrix expanded representation is as follows:
The above matrix operation results in the computation of matrix Z of size (1 x m) as shown below:
The next step is to derive activations by applying the sigmoid function on the computed linear combinations for each input as shown in the following matrix operation. This results in an activation matrix A of size (1 x m).
Backward Propagation
Backward propagation or back-propagation is a technique to compute the contributions of each parameter to the overall error or loss caused by incorrect predictions at the end of each epoch. The individual loss contributions are evaluated by computing the gradients of the loss function with respect to (w.r.t) each model parameter. A gradient or derivative of a function is the rate of change or the slope of that function w.r.t a parameter considering other parameters as constants. When evaluated for a specific parameter value or point, the sign of the gradient indicates in which direction the function increases, and the gradient magnitude indicates the steepness of the slope. The log loss function as shown below is a bowl-shaped convex function with one global minimum point. As such, in most cases, the gradient of the log loss function w.r.t a parameter points in the opposite direction to the global minima. Once gradients are evaluated, each parameter value is updated using the parameter’s gradient, typically by using a technique called gradient descent.
The gradient for each parameter is computed using the chain rule. The chain rule enables the computation of derivatives of functions that are composed of other functions. In the case of Logistic Regression, the log loss L is a function of activation ‘a’ and ground-truth label ‘y’, while ‘a’ itself is a sigmoid function of ‘z’ and ‘z’ is a linear function of weights ‘w’ and bias ‘b’ implying that the loss function L is a function composed of other functions as shown below.
Using the chain rule of partial derivatives, the gradients of weight and bias parameters can be computed as follows:
Derivation of Gradients for Single Input Instance
Before we review the matrix and vector representations that come into play as part of updating the parameters in one shot, we will first derive the gradients using a single input instance to understand the basis for such representations better.
Assuming that ‘a’ and ‘z’ represent computed values for a single input instance with the ground-truth label ‘y’, the gradient of the loss function w.r.t ‘a’ can be derived as follows. Note that this gradient is the first quantity required to evaluate the chain rule to derive parameter gradients later.
Given the gradient of loss function w.r.t ‘a’, the gradient of loss function w.r.t ‘z’ can be derived using the following chain rule:
The above chain rule implies that the gradient of ‘a’ w.r.t ‘z’ must also be derived. Note that ‘a’ is computed by applying the sigmoid function on ‘z’. Therefore, the gradient of ‘a’ w.r.t ‘z’ can be derived by using the sigmoid function expression as follows:
The above derivation is expressed in terms of ‘e’, and it appears that additional computations are needed to evaluate the gradient of ‘a’ w.r.t ‘z’. We know that ‘a’ gets computed as part of forward propagation. Therefore to eliminate any additional computations, the above derivative can be fully expressed in terms of ‘a’ instead as follows:
Plugging in the above terms expressed in ‘a’, the gradient of ‘a’ w.r.t ‘z’ is as follows:
Now that we have the gradient of loss function w.r.t ‘a’ and the gradient of ‘a’ w.r.t ‘z’, the gradient of loss function w.r.t ‘z’ can be evaluated as follows:
We came a long way in evaluating the gradient of loss function w.r.t ‘z’. We still need to evaluate the gradients of loss function w.r.t model parameters. We know that ‘z’ is a linear combination of model parameters and features of an input instance ‘x’ as shown below:
Using the chain rule the gradient of loss function w.r.t weight parameter ‘wi’ gets evaluated as shown below:
Similarly, the gradient of the loss function w.r.t ‘b’ gets evaluated as follows:
Matrix and Vector Representation of Parameter Updates using Gradients
Now that we understand gradient formulas for model parameters derived using a single input instance, we can represent the formulas in matrix and vector forms accounting for the entire training batch. We will first vectorize gradients of the loss function w.r.t ‘z’ given by the following expression:
The vector form of the above for the entire ‘m’ number of instances is:
Similarly, the gradients of the loss function w.r.t each weight ‘wi’ can be vectorized. The gradient of the loss function w.r.t weight ‘wi’ for a single instance is given by:
The vector form of the above for all weights across all ‘m’ input instances is evaluated as the mean of ‘m’ gradients as follows:
Similarly, the resultant gradient of loss function w.r.t ‘b’ across all ‘m’ input instances is computed as a mean of the individual instance gradients as follows:
Given the model weights gradient vector and the overall gradient for bias, the model parameters get updated as follows. The parameter updates as shown below are based on the technique called gradient descent where a learning rate is used. A learning rate is a hyper-parameter used in optimization techniques such as gradient descent to control the step size of adjustments made at each epoch to the model parameters based on computed gradients. Effectively, a learning rate acts as a scaling factor, influencing the speed and convergence of the optimization algorithm.
Conclusion
As evident from the matrix and vector representations illustrated in this blog, Logistic Regression enables a bare minimum network model to understand the subtle working details of such matrix and vector operations. Most machine-learning libraries encapsulate such nitty-gritty mathematical details but instead expose well-defined programming interfaces at a higher level, such as forward or backward propagation. While understanding all such subtle details may not be required to develop models using such libraries, such details do shed light on the mathematical intuitions behind such algorithms. However, such understanding will certainly help carry forward the underlying mathematical intuitions to other models such as ANN, Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Generative Adversarial Networks (GAN).
This post originally appeared on TechToday.