Loss functions
MSE
The mean squared error (MSE) loss function measures the average squared difference between the ground truth values (\(y\)) and the predicted values (\(\hat{y}\)). It is differentiable everywhere and yields a gradient that is linear in the prediction error.
It is defined as
Its derivative with respect to the prediction for sample \(i\) is
BCE
The binary cross entropy (BCE) loss function measures the average discrepancy between the ground truth labels (\(y\)) and the predicted probabilities (\(\hat{y}\)) for binary classification. It is differentiable for \(\hat{y} \in (0,1)\) and yields a gradient that increases sharply as predictions become confidently incorrect.
It is defined as
Its derivative with respect to the prediction for sample \(i\) is
Here is the step-by-step derivation
Note
In practice, this direct per-sample derivative is rarely applied on its own. It is typically combined with a sigmoid activation function applied to the model output, producing a single, simplified gradient:
This combined form improves both computational efficiency and numerical stability during training.
CCE
The categorical cross entropy (CCE) loss function measures the average discrepancy between the multi-class ground truth labels (\(y\)), which are typically one-hot encoded, and the predicted probability distribution (\(\hat{y}\)). It is differentiable for each \(\hat{y}_{i,k} \in (0,1)\) and yields a gradient that increases sharply as predictions become confidently incorrect.
It is defined as
where:
- \(K\) is the number of classes
- \(y_{i,k}\) is the ground truth (1 if sample \(i\) belongs to class \(k\), else 0)
- \(\hat{y}_{i,k}\) is the predicted probability for class \(k\)
Its derivative with respect to the prediction for sample \(i\) and class \(k\) is
Here is the step-by-step derivation
Note that the derivative with respect to the multi-class prediction vector \(\hat{y}_i\) is a \(K\)-dimensional vector, not a scalar. For sample \(i\)
Equivalently, in compact vector notation:
where the division is element-wise.
Note
In practice, this direct per-sample derivative is rarely applied on its own. It is typically combined with a softmax activation function applied to the model output, producing a single, simplified gradient:
Analogously to the BCE + sigmoid case, this compact form improves both computational efficiency and numerical stability during training.