In this notebook, we will explore examples that illustrate how to implement a neural network for multiclass classification tasks. Some of these approaches are conceptually wrong, but highlight the functional differences between the models, as well as motivate the need for the softmax activation and the categorical cross-entropy (CCE) loss.
Setup¶
import numpy as np
import matplotlib.pyplot as plt
# set random seed for reproducibility
np.random.seed(42)
from nnfs.layers import Dense
from nnfs.model import Sequential
from nnfs.losses import BCE, MSE, CCE
from nnfs.optimizers import SGD
from nnfs.activations import Sigmoid, Softmax
from nnfs.datasets.data_generators import generate_concentric_circles
from nnfs.utils import shuffle, class_to_onehot
Concentric circles (continuous)¶
Here is a slightly more complex version of the concentric circles dataset from the previous notebook. In this version, there are more than two labels, so we cannot use the same model we used for binary classification.
We will first approach this as a regression task.
Data generation¶
# generate data
X_data, y_true = generate_concentric_circles(500, 3)
# plot data
plt.scatter(X_data[:,0], X_data[:,1], c=y_true)
plt.axis('square')
None
Model specification¶
We could approach this from regression point of view: treat the labels as if they were an arbitrary quantity we want to predict. We will use sigmoid layers for non-linearity, but the last layer will be a scalar output directly from a Dense layer. Also, we will use mean squared error (MSE) as a loss.
# define the model
list_layers = [Dense(2, 12), Sigmoid(),
Dense(12, 2), Sigmoid(),
Dense(2, 1)]
loss = MSE()
optimizer = SGD(0.01)
model = Sequential(list_layers, loss, optimizer)
model.summary()
Model layers:
* Dense_0 | Dimensions: 2 x 12 | Parameters: 36
* Sigmoid_1
* Dense_2 | Dimensions: 12 x 2 | Parameters: 26
* Sigmoid_3
* Dense_4 | Dimensions: 2 x 1 | Parameters: 3
--------------------
Total parameters: 65
Training¶
# fit the model to the data
history = model.fit(X_data, y_true, 50000, debug_flag=True)
Epoch 1 - Loss: 0.8731996385897649 Epoch 5000 - Loss: 0.6252142270192054 Epoch 10000 - Loss: 0.5752668906491387 Epoch 15000 - Loss: 0.555172372151669 Epoch 20000 - Loss: 0.541803708519793 Epoch 25000 - Loss: 0.4440078600019571 Epoch 30000 - Loss: 0.343741508693696 Epoch 35000 - Loss: 0.07582738508980377 Epoch 40000 - Loss: 0.06537875879583159 Epoch 45000 - Loss: 0.06370496202578878 Epoch 50000 - Loss: 0.06280455843686646
# visualize loss during training
plt.plot(history['loss'])
plt.xlabel("Epoch")
plt.ylabel("Loss")
None
Evaluation¶
As we can see, the model has learned to approximate the category, but since the labels are treated as continuous scalars, there is a continuous gradient of predicted categories.
# produce predictions
y_pred = model.forward(X_data)
# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=y_pred)
plt.axis('square')
plt.colorbar()
ax = plt.gca()
None
Since the examples in this notebook consist of 2D coordinates, we can easily plot the decision boundary of the model.
def plot_decision_boundary(title, form='square', s=100, discretize=False, categorical=False):
# create a grid to evaluate the model
xmin, xmax = ax.get_xlim()
ymin, ymax = ax.get_ylim()
xx, yy = np.meshgrid(np.linspace(xmin, xmax, 200),
np.linspace(ymin, ymax, 200))
grid = np.c_[xx.ravel(), yy.ravel()]
# evaluate the model on the grid
z = model.forward(grid)
if categorical:
z = np.argmax(z, axis=1)
elif discretize:
z = z.astype(int)
z = z.reshape(xx.shape)
# plot model predictions
y_pred = model.forward(X_data)
if categorical:
y_pred = np.argmax(y_pred, axis=1)
elif discretize:
y_pred = y_pred.astype(int)
plt.scatter(X_data[:,0], X_data[:,1], c=y_pred, s=s, edgecolor='k', cmap='coolwarm')
# plot decision boundary
plt.contourf(xx, yy, z, alpha=0.5, cmap='coolwarm')
plt.axis(form)
plt.xlabel('x1')
plt.ylabel('x2')
plt.title(title)
plt.show()
To assign a category, we will round each predicted label to its nearest integer.
plot_decision_boundary('Concentric circles (continuous)', s=35, discretize=True)
This is clearly not the best way of approaching a multiclass classification task.
Concentric circles (categorical)¶
Next, we will try to make use of our binary classification model, extended to several classes.
Data generation¶
# we will use the same data as the previous example
plt.scatter(X_data[:,0], X_data[:,1], c=y_true)
plt.axis('square')
None
When working with multiclass classification tasks, we typically convert the categorical data to one-hot encoded vectors, as shown below.
# description of one-hot encoded labels
one_hot = class_to_onehot(y_true, 3)
for i in [0, 500, 1000]:
print(f"Class: {y_true[i]}\t one_hot: {one_hot[i]}")
print(one_hot.shape)
Class: [0.] one_hot: [1. 0. 0.] Class: [1.] one_hot: [0. 1. 0.] Class: [2.] one_hot: [0. 0. 1.] (1500, 3)
Model specification¶
Instead of having a single neuron at the end like in the previous notebook, we will have three. Conceptually, this means that the model outputs three independent predictions. Thus, the model does not generate one categorical prediction, but rather, three independent predictions, which are not exclusive (it is possible for the model to assign the three labels to one data point).
# define the model
list_layers = [Dense(2, 16), Sigmoid(),
Dense(16, 3), Sigmoid(),
]
loss = BCE()
optimizer = SGD(0.001)
model = Sequential(list_layers, loss, optimizer)
Training¶
# fit the model to the data
history = model.fit(X_data, one_hot, 25000, debug_flag=True)
Epoch 1 - Loss: 0.6523356617440061 Epoch 2500 - Loss: 0.05850089134887879 Epoch 5000 - Loss: 0.037182944409727595 Epoch 7500 - Loss: 0.02985472171180443 Epoch 10000 - Loss: 0.026134567325523963 Epoch 12500 - Loss: 0.023826411212067713 Epoch 15000 - Loss: 0.02221872740482869 Epoch 17500 - Loss: 0.02100862713661891 Epoch 20000 - Loss: 0.020045883965823897 Epoch 22500 - Loss: 0.019247377678987404 Epoch 25000 - Loss: 0.018561210586293182
# visualize loss during training
plt.plot(history['loss'])
plt.xlabel("Epoch")
plt.ylabel("Loss")
None
Evaluation¶
Since the predictions are three independent probabilities, we will have to process them to generate a final categorical assignment. A common choice is to take the most confident prediction as the categorical prediction. This is performed by the argmax function.
# produce predictions
y_pred = model.forward(X_data)
y_pred
array([[9.99849463e-01, 6.73475071e-04, 4.53036907e-10],
[9.99934434e-01, 3.35999590e-04, 3.57851105e-10],
[9.99943419e-01, 2.97204875e-04, 3.46879529e-10],
...,
[4.38627685e-15, 1.31412144e-04, 9.99428027e-01],
[1.36684797e-13, 8.65530829e-08, 9.99999229e-01],
[5.40676721e-14, 1.96785164e-07, 9.99999911e-01]], shape=(1500, 3))
# extract the most confident categorical prediction
np.argmax(y_pred, axis=1)
array([0, 0, 0, ..., 2, 2, 2], shape=(1500,))
# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1))
plt.axis('square')
None
plot_decision_boundary('Concentric circles (categorical)', s=35, categorical=True)
This looks much better! Although this naive approach works well for small toy problems like this one, in practice, it is rarely used.
Nine circles of hell¶
In order to explore some of the improvements that we can introduce in the training procedure, as well as in the model architecture, let us first construct a baseline using the previously explained naive approach.
Here, we will generate a dataset of 9 concentric circles, each with their own label (from 0 to 8).
Data generation¶
# generate data
X_data, y_true = generate_concentric_circles(500, 9)
# shuffle data (this will be very important later!)
X_data, y_true = shuffle([X_data, y_true])
plt.scatter(X_data[:,0], X_data[:,1], c=y_true, cmap="coolwarm")
plt.axis('square')
None
(4500, 2) (4500, 1)
# description of one-hot encoded labels
one_hot = class_to_onehot(y_true, 9)
for i in range(10):
print(f"Class: {y_true[i]}\t one_hot: {one_hot[i]}")
print(one_hot.shape)
Class: [5.] one_hot: [0. 0. 0. 0. 0. 1. 0. 0. 0.] Class: [1.] one_hot: [0. 1. 0. 0. 0. 0. 0. 0. 0.] Class: [7.] one_hot: [0. 0. 0. 0. 0. 0. 0. 1. 0.] Class: [2.] one_hot: [0. 0. 1. 0. 0. 0. 0. 0. 0.] Class: [7.] one_hot: [0. 0. 0. 0. 0. 0. 0. 1. 0.] Class: [1.] one_hot: [0. 1. 0. 0. 0. 0. 0. 0. 0.] Class: [6.] one_hot: [0. 0. 0. 0. 0. 0. 1. 0. 0.] Class: [0.] one_hot: [1. 0. 0. 0. 0. 0. 0. 0. 0.] Class: [8.] one_hot: [0. 0. 0. 0. 0. 0. 0. 0. 1.] Class: [8.] one_hot: [0. 0. 0. 0. 0. 0. 0. 0. 1.] (4500, 9)
Model specification¶
The only changes with respect to the model will be to double the number of neurons in the hidden layer, and to set 9 output neurons, in order to generate 9 categorical predictions for each data point.
# define the model
list_layers = [Dense(2, 32), Sigmoid(),
Dense(32, 9), Sigmoid(),
]
loss = BCE()
optimizer = SGD(0.001)
model = Sequential(list_layers, loss, optimizer)
model.summary()
Model layers:
* Dense_0 | Dimensions: 2 x 32 | Parameters: 96
* Sigmoid_1
* Dense_2 | Dimensions: 32 x 9 | Parameters: 297
* Sigmoid_3
--------------------
Total parameters: 393
Training¶
# fit the model to the data
history = model.fit(X_data, one_hot, 10000, debug_flag=True)
Epoch 1 - Loss: 1.8462235301217356 Epoch 1000 - Loss: 0.19281578716736733 Epoch 2000 - Loss: 0.179232032974487 Epoch 3000 - Loss: 0.154472274545469 Epoch 4000 - Loss: 0.1536412117016907 Epoch 5000 - Loss: 0.14053347594180882 Epoch 6000 - Loss: 0.17628894202977632 Epoch 7000 - Loss: 0.13647826608178823 Epoch 8000 - Loss: 0.13931739606063548 Epoch 9000 - Loss: 0.12815559170115878 Epoch 10000 - Loss: 0.13194109693590955
Evaluation¶
# produce predictions
y_pred = model.forward(X_data)
# extract the most confident categorical prediction
np.argmax(y_pred, axis=1)
array([5, 1, 6, ..., 1, 0, 3], shape=(4500,))
# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1), cmap='coolwarm')
plt.axis('square')
ax = plt.gca()
None
plot_decision_boundary('Decision boundary of nine circles', s=35, categorical=True)
This does not look that bad. Can we do better?
Training improvements¶
Yes, we can. In fact, all the previous examples in the notebooks have been naive, in the sense that they have overlooked two very common improvements in model training: momentum and batching.
In simple terms:
momentum: if, during successive training steps, a given parameter of the network consistently changes in the same direction (either increasing or decreasing), momentum accelerates this change. This can greatly increase learning speed, especially in situations where we need to use a low base learning rate (
lr).batching: instead of computing a single update using the entire dataset, we split the data into batches of N samples and perform one training step per batch. While the gradient computed from each mini-batch is a noisier estimate of the true gradient, the increased number of updates per epoch often compensates for this, resulting in improved training performance. This is where the stochastic in stochastic gradient descent (
SGD) comes from.
Below, we will compare the performance (expressed a loss vs training steps) between the baseline *default" model that is shown above, and momentum and batching improvements. The model architecture will be kept constant between tests.
def test_model(X_data, one_hot, momentum, batch_size=-1):
# define model
list_layers = [Dense(2, 32), Sigmoid(),
Dense(32, 9), Sigmoid(),]
loss = BCE()
optimizer = SGD(lr=0.001, momentum=momentum)
model = Sequential(list_layers, loss, optimizer)
# fit the model to the data
history = model.fit(X_data, one_hot, 10000, debug_flag=False, batch_size=batch_size)
return history, model
# store the training history and model for each test
dict_histories = {}
dict_histories['Default'] = (history, model)
dict_histories['Momentum'] = test_model(X_data, one_hot, momentum=0.35, batch_size=-1)
dict_histories['Batching'] = test_model(X_data, one_hot, momentum=0.0, batch_size=450)
dict_histories['Momentum + Batching'] = test_model(X_data, one_hot, momentum=0.75, batch_size=450)
# compare the loss vs epoch plot for each training strategy
for name, (history, model) in dict_histories.items():
plt.plot(history['loss'].mean(axis=1), label=name)
plt.legend()
plt.ylim([0.075, 0.35])
plt.xlabel("Epoch")
plt.ylabel("Loss")
None
Besides the training loss curves, we also want to inspect the final result: how well does the trained model reproduce the training data? For this, we will compute a scoring metric known as accuracy, which can be simply thought of as the ratio of correctly classified samples.
fig, axarr = plt.subplots(2, 2, figsize=(7,7))
for i, (name, (history, model)) in enumerate(dict_histories.items()):
# determine correc axis
ax = axarr.flatten()[i]
# compute predictions and accuracy
y_pred = model.forward(X_data)
accuracy = sum(y_true == np.argmax(y_pred, axis=1).reshape(-1, 1)) / y_true.shape[0]
accuracy = accuracy.item()
# graph predictions
ax.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1), cmap='coolwarm')
ax.set_title(f"{name} : {accuracy:.2f}")
ax.set_xticks([])
ax.set_yticks([])
ax.axis('equal')
plt.tight_layout()
Model improvements¶
Up until now we have been modelling the multiclass classification problem as independent categorical predictions. Although it is not conceptually correct (in these examples, the categories are exclusive, only one is correct), we have shown that it works.
On the other hand, the training of this model is relatively unstable. We have had to use a small learning rate (0.001) to prevent gradients from dying or exploding (try to increase the learning rate!), and even then, the loss/epoch curve is jittery. The issue stems from how the model treats the independent predictions; since they are independent, lowering or increasing one does not affect the others, but it does affect the loss computation! This is the price we must pay for a poor choice of neural network architecture.
Note: The multi-neuron sigmoid output would be conceptually valid for a multilabel classification task, where each sample can belong to multiple labels.
Luckily for us, there are alternatives. The problems highlighted here have motivated the development of the softmax activation layer, and its corresponding loss function, the categorical cross-entropy (CCE), a multiclass variant of binary cross-entropy. Combined, these two correctly representing a multiclass classification problem, where the sum of probabilities must be 1 (and thus, increasing one, must decrease the others).
# define the model
list_layers = [Dense(2, 32), Sigmoid(),
Dense(32, 9), Softmax(),
]
loss = CCE()
optimizer = SGD(0.01, momentum=0.9)
model = Sequential(list_layers, loss, optimizer)
model.summary()
Model layers:
* Dense_0 | Dimensions: 2 x 32 | Parameters: 96
* Sigmoid_1
* Dense_2 | Dimensions: 32 x 9 | Parameters: 297
* Softmax_3
--------------------
Total parameters: 393
# fit the model to the data
history_softmax = model.fit(X_data, one_hot, 5000, batch_size=450, debug_flag=True)
Epoch 1 - Loss: 2.2977707259796505 Epoch 500 - Loss: 1.1347379975163048 Epoch 1000 - Loss: 0.8713382981687708 Epoch 1500 - Loss: 0.7409672716719414 Epoch 2000 - Loss: 0.6571598310857691 Epoch 2500 - Loss: 0.596364809471667 Epoch 3000 - Loss: 0.5493789666165066 Epoch 3500 - Loss: 0.5113401743042799 Epoch 4000 - Loss: 0.48008742446965613 Epoch 4500 - Loss: 0.4532290265608863 Epoch 5000 - Loss: 0.43047442829646254
# visualize loss during training
plt.plot(history_softmax['loss'].mean(axis=1))
plt.xlabel("Epoch")
plt.ylabel("Loss")
None
Notice how, even using a higher learning rate, the loss/epoch curve is smooth and continuous. Also, the slope at the end suggests that we could keep training the model to improve even further the performance.
# produce predictions
y_pred = model.forward(X_data)
# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1), cmap='coolwarm')
accuracy = sum(y_true == np.argmax(y_pred, axis=1).reshape(-1, 1)) / y_true.shape[0]
accuracy = accuracy.item()
plt.title(f"CCE + {name} : {accuracy:.2f}")
plt.axis('square')
ax = plt.gca()
None
plot_decision_boundary('Decision boundary of nine circles (improved)', s=35, categorical=True)