In this notebook, we will explore examples that illustrate how to implement a neural network for multiclass classification tasks. Some of these approaches are conceptually wrong, but highlight the functional differences between the models, as well as motivate the need for the softmax activation and the categorical cross-entropy (CCE) loss.

Setup¶

In [1]:

Copied!

import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt

In [2]:

Copied!

# set random seed for reproducibility
np.random.seed(42)
# set random seed for reproducibility
np.random.seed(42)

In [3]:

Copied!





from nnfs.layers import Dense
from nnfs.model import Sequential
from nnfs.losses import BCE, MSE, CCE
from nnfs.optimizers import SGD
from nnfs.activations import Sigmoid, Softmax

from nnfs.datasets.data_generators import generate_concentric_circles
from nnfs.utils import shuffle, class_to_onehot
from nnfs.layers import Dense
from nnfs.model import Sequential
from nnfs.losses import BCE, MSE, CCE
from nnfs.optimizers import SGD
from nnfs.activations import Sigmoid, Softmax

from nnfs.datasets.data_generators import generate_concentric_circles
from nnfs.utils import shuffle, class_to_onehot

Concentric circles (continuous)¶

Here is a slightly more complex version of the concentric circles dataset from the previous notebook. In this version, there are more than two labels, so we cannot use the same model we used for binary classification.

We will first approach this as a regression task.

Data generation¶

In [4]:

Copied!





# generate data
X_data, y_true = generate_concentric_circles(500, 3)

# plot data
plt.scatter(X_data[:,0], X_data[:,1], c=y_true)
plt.axis('square')
None
# generate data
X_data, y_true = generate_concentric_circles(500, 3)

# plot data
plt.scatter(X_data[:,0], X_data[:,1], c=y_true)
plt.axis('square')
None

No description has been provided for this image

Model specification¶

We could approach this from regression point of view: treat the labels as if they were an arbitrary quantity we want to predict. We will use sigmoid layers for non-linearity, but the last layer will be a scalar output directly from a Dense layer. Also, we will use mean squared error (MSE) as a loss.

In [5]:

Copied!





# define the model
list_layers = [Dense(2, 12),  Sigmoid(),
               Dense(12, 2), Sigmoid(),
               Dense(2, 1)]
loss = MSE()
optimizer = SGD(0.01)
model = Sequential(list_layers, loss, optimizer)
model.summary()
# define the model
list_layers = [Dense(2, 12),  Sigmoid(),
               Dense(12, 2), Sigmoid(),
               Dense(2, 1)]
loss = MSE()
optimizer = SGD(0.01)
model = Sequential(list_layers, loss, optimizer)
model.summary()

Model layers:
    * Dense_0  | Dimensions: 2 x 12 | Parameters: 36
    * Sigmoid_1 
    * Dense_2  | Dimensions: 12 x 2 | Parameters: 26
    * Sigmoid_3 
    * Dense_4  | Dimensions: 2 x 1 | Parameters: 3
    --------------------
    Total parameters: 65

Training¶

In [6]:

Copied!

# fit the model to the data
history = model.fit(X_data, y_true, 50000, debug_flag=True)
# fit the model to the data
history = model.fit(X_data, y_true, 50000, debug_flag=True)

Epoch 1 - Loss: 0.8731996385897649
Epoch 5000 - Loss: 0.6252142270192054
Epoch 10000 - Loss: 0.5752668906491387
Epoch 15000 - Loss: 0.555172372151669
Epoch 20000 - Loss: 0.541803708519793
Epoch 25000 - Loss: 0.4440078600019571
Epoch 30000 - Loss: 0.343741508693696
Epoch 35000 - Loss: 0.07582738508980377
Epoch 40000 - Loss: 0.06537875879583159
Epoch 45000 - Loss: 0.06370496202578878
Epoch 50000 - Loss: 0.06280455843686646

In [7]:

Copied!





# visualize loss during training
plt.plot(history['loss'])
plt.xlabel("Epoch")
plt.ylabel("Loss")
None
# visualize loss during training
plt.plot(history['loss'])
plt.xlabel("Epoch")
plt.ylabel("Loss")
None

Evaluation¶

As we can see, the model has learned to approximate the category, but since the labels are treated as continuous scalars, there is a continuous gradient of predicted categories.

In [8]:

Copied!

# produce predictions
y_pred = model.forward(X_data)
# produce predictions
y_pred = model.forward(X_data)

In [9]:

Copied!





# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=y_pred)
plt.axis('square')
plt.colorbar()
ax = plt.gca()
None
# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=y_pred)
plt.axis('square')
plt.colorbar()
ax = plt.gca()
None

Since the examples in this notebook consist of 2D coordinates, we can easily plot the decision boundary of the model.

In [10]:

Copied!





def plot_decision_boundary(title, form='square', s=100, discretize=False, categorical=False):
    # create a grid to evaluate the model
    xmin, xmax = ax.get_xlim()
    ymin, ymax = ax.get_ylim()
    
    xx, yy = np.meshgrid(np.linspace(xmin, xmax, 200),
                         np.linspace(ymin, ymax, 200))
    grid = np.c_[xx.ravel(), yy.ravel()]

    # evaluate the model on the grid
    z = model.forward(grid)
    if categorical:
        z = np.argmax(z, axis=1)
    elif discretize:
        z = z.astype(int)
    z = z.reshape(xx.shape)

    # plot model predictions
    y_pred = model.forward(X_data)
    if categorical:
        y_pred = np.argmax(y_pred, axis=1)
    elif discretize:
        y_pred = y_pred.astype(int)
    plt.scatter(X_data[:,0], X_data[:,1], c=y_pred, s=s, edgecolor='k', cmap='coolwarm')
 
    # plot decision boundary
    plt.contourf(xx, yy, z, alpha=0.5, cmap='coolwarm')
    plt.axis(form)
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.title(title)
    plt.show()
def plot_decision_boundary(title, form='square', s=100, discretize=False, categorical=False):
    # create a grid to evaluate the model
    xmin, xmax = ax.get_xlim()
    ymin, ymax = ax.get_ylim()
    
    xx, yy = np.meshgrid(np.linspace(xmin, xmax, 200),
                         np.linspace(ymin, ymax, 200))
    grid = np.c_[xx.ravel(), yy.ravel()]

    # evaluate the model on the grid
    z = model.forward(grid)
    if categorical:
        z = np.argmax(z, axis=1)
    elif discretize:
        z = z.astype(int)
    z = z.reshape(xx.shape)

    # plot model predictions
    y_pred = model.forward(X_data)
    if categorical:
        y_pred = np.argmax(y_pred, axis=1)
    elif discretize:
        y_pred = y_pred.astype(int)
    plt.scatter(X_data[:,0], X_data[:,1], c=y_pred, s=s, edgecolor='k', cmap='coolwarm')
 
    # plot decision boundary
    plt.contourf(xx, yy, z, alpha=0.5, cmap='coolwarm')
    plt.axis(form)
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.title(title)
    plt.show()

To assign a category, we will round each predicted label to its nearest integer.

In [11]:

Copied!

plot_decision_boundary('Concentric circles (continuous)', s=35, discretize=True)
plot_decision_boundary('Concentric circles (continuous)', s=35, discretize=True)

This is clearly not the best way of approaching a multiclass classification task.

Concentric circles (categorical)¶

Next, we will try to make use of our binary classification model, extended to several classes.

Data generation¶

In [12]:

Copied!





# we will use the same data as the previous example
plt.scatter(X_data[:,0], X_data[:,1], c=y_true)
plt.axis('square')
None
# we will use the same data as the previous example
plt.scatter(X_data[:,0], X_data[:,1], c=y_true)
plt.axis('square')
None

When working with multiclass classification tasks, we typically convert the categorical data to one-hot encoded vectors, as shown below.

In [13]:

Copied!





# description of one-hot encoded labels
one_hot = class_to_onehot(y_true, 3)
for i in [0, 500, 1000]:
    print(f"Class: {y_true[i]}\t one_hot: {one_hot[i]}")
print(one_hot.shape)
# description of one-hot encoded labels
one_hot = class_to_onehot(y_true, 3)
for i in [0, 500, 1000]:
    print(f"Class: {y_true[i]}\t one_hot: {one_hot[i]}")
print(one_hot.shape)

Class: [0.]	 one_hot: [1. 0. 0.]
Class: [1.]	 one_hot: [0. 1. 0.]
Class: [2.]	 one_hot: [0. 0. 1.]
(1500, 3)

Model specification¶

Instead of having a single neuron at the end like in the previous notebook, we will have three. Conceptually, this means that the model outputs three independent predictions. Thus, the model does not generate one categorical prediction, but rather, three independent predictions, which are not exclusive (it is possible for the model to assign the three labels to one data point).

In [14]:

Copied!





# define the model
list_layers = [Dense(2, 16),  Sigmoid(),
               Dense(16, 3), Sigmoid(),
              ]
loss = BCE()
optimizer = SGD(0.001)
model = Sequential(list_layers, loss, optimizer)
# define the model
list_layers = [Dense(2, 16),  Sigmoid(),
               Dense(16, 3), Sigmoid(),
              ]
loss = BCE()
optimizer = SGD(0.001)
model = Sequential(list_layers, loss, optimizer)

Training¶

In [15]:

Copied!

# fit the model to the data
history = model.fit(X_data, one_hot, 25000, debug_flag=True)
# fit the model to the data
history = model.fit(X_data, one_hot, 25000, debug_flag=True)

Epoch 1 - Loss: 0.6523356617440061
Epoch 2500 - Loss: 0.05850089134887879
Epoch 5000 - Loss: 0.037182944409727595
Epoch 7500 - Loss: 0.02985472171180443
Epoch 10000 - Loss: 0.026134567325523963
Epoch 12500 - Loss: 0.023826411212067713
Epoch 15000 - Loss: 0.02221872740482869
Epoch 17500 - Loss: 0.02100862713661891
Epoch 20000 - Loss: 0.020045883965823897
Epoch 22500 - Loss: 0.019247377678987404
Epoch 25000 - Loss: 0.018561210586293182

In [16]:

Copied!





# visualize loss during training
plt.plot(history['loss'])
plt.xlabel("Epoch")
plt.ylabel("Loss")
None
# visualize loss during training
plt.plot(history['loss'])
plt.xlabel("Epoch")
plt.ylabel("Loss")
None

Evaluation¶

Since the predictions are three independent probabilities, we will have to process them to generate a final categorical assignment. A common choice is to take the most confident prediction as the categorical prediction. This is performed by the argmax function.

In [17]:

Copied!

# produce predictions
y_pred = model.forward(X_data)
y_pred
# produce predictions
y_pred = model.forward(X_data)
y_pred

Out[17]:

array([[9.99849463e-01, 6.73475071e-04, 4.53036907e-10],
       [9.99934434e-01, 3.35999590e-04, 3.57851105e-10],
       [9.99943419e-01, 2.97204875e-04, 3.46879529e-10],
       ...,
       [4.38627685e-15, 1.31412144e-04, 9.99428027e-01],
       [1.36684797e-13, 8.65530829e-08, 9.99999229e-01],
       [5.40676721e-14, 1.96785164e-07, 9.99999911e-01]], shape=(1500, 3))

In [18]:

Copied!

# extract the most confident categorical prediction
np.argmax(y_pred, axis=1)
# extract the most confident categorical prediction
np.argmax(y_pred, axis=1)

Out[18]:

array([0, 0, 0, ..., 2, 2, 2], shape=(1500,))

In [19]:

Copied!





# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1))
plt.axis('square')
None
# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1))
plt.axis('square')
None

In [20]:

Copied!

plot_decision_boundary('Concentric circles (categorical)', s=35, categorical=True)
plot_decision_boundary('Concentric circles (categorical)', s=35, categorical=True)

This looks much better! Although this naive approach works well for small toy problems like this one, in practice, it is rarely used.

Nine circles of hell¶

In order to explore some of the improvements that we can introduce in the training procedure, as well as in the model architecture, let us first construct a baseline using the previously explained naive approach.

Here, we will generate a dataset of 9 concentric circles, each with their own label (from 0 to 8).

Data generation¶

In [21]:

Copied!





# generate data
X_data, y_true = generate_concentric_circles(500, 9)

# shuffle data (this will be very important later!)
X_data, y_true = shuffle([X_data, y_true])

plt.scatter(X_data[:,0], X_data[:,1], c=y_true, cmap="coolwarm")
plt.axis('square')
None
# generate data
X_data, y_true = generate_concentric_circles(500, 9)

# shuffle data (this will be very important later!)
X_data, y_true = shuffle([X_data, y_true])

plt.scatter(X_data[:,0], X_data[:,1], c=y_true, cmap="coolwarm")
plt.axis('square')
None

(4500, 2)
(4500, 1)

In [22]:

Copied!





# description of one-hot encoded labels
one_hot = class_to_onehot(y_true, 9)
for i in range(10):
    print(f"Class: {y_true[i]}\t one_hot: {one_hot[i]}")
print(one_hot.shape)
# description of one-hot encoded labels
one_hot = class_to_onehot(y_true, 9)
for i in range(10):
    print(f"Class: {y_true[i]}\t one_hot: {one_hot[i]}")
print(one_hot.shape)

Class: [5.]	 one_hot: [0. 0. 0. 0. 0. 1. 0. 0. 0.]
Class: [1.]	 one_hot: [0. 1. 0. 0. 0. 0. 0. 0. 0.]
Class: [7.]	 one_hot: [0. 0. 0. 0. 0. 0. 0. 1. 0.]
Class: [2.]	 one_hot: [0. 0. 1. 0. 0. 0. 0. 0. 0.]
Class: [7.]	 one_hot: [0. 0. 0. 0. 0. 0. 0. 1. 0.]
Class: [1.]	 one_hot: [0. 1. 0. 0. 0. 0. 0. 0. 0.]
Class: [6.]	 one_hot: [0. 0. 0. 0. 0. 0. 1. 0. 0.]
Class: [0.]	 one_hot: [1. 0. 0. 0. 0. 0. 0. 0. 0.]
Class: [8.]	 one_hot: [0. 0. 0. 0. 0. 0. 0. 0. 1.]
Class: [8.]	 one_hot: [0. 0. 0. 0. 0. 0. 0. 0. 1.]
(4500, 9)

Model specification¶

The only changes with respect to the model will be to double the number of neurons in the hidden layer, and to set 9 output neurons, in order to generate 9 categorical predictions for each data point.

In [23]:

Copied!





# define the model
list_layers = [Dense(2, 32),  Sigmoid(),
               Dense(32, 9), Sigmoid(),
              ]
loss = BCE()

optimizer = SGD(0.001)
model = Sequential(list_layers, loss, optimizer)
model.summary()
# define the model
list_layers = [Dense(2, 32),  Sigmoid(),
               Dense(32, 9), Sigmoid(),
              ]
loss = BCE()

optimizer = SGD(0.001)
model = Sequential(list_layers, loss, optimizer)
model.summary()

Model layers:
    * Dense_0  | Dimensions: 2 x 32 | Parameters: 96
    * Sigmoid_1 
    * Dense_2  | Dimensions: 32 x 9 | Parameters: 297
    * Sigmoid_3 
    --------------------
    Total parameters: 393

Training¶

In [24]:

Copied!

# fit the model to the data
history = model.fit(X_data, one_hot, 10000, debug_flag=True)
# fit the model to the data
history = model.fit(X_data, one_hot, 10000, debug_flag=True)

Epoch 1 - Loss: 1.8462235301217356
Epoch 1000 - Loss: 0.19281578716736733
Epoch 2000 - Loss: 0.179232032974487
Epoch 3000 - Loss: 0.154472274545469
Epoch 4000 - Loss: 0.1536412117016907
Epoch 5000 - Loss: 0.14053347594180882
Epoch 6000 - Loss: 0.17628894202977632
Epoch 7000 - Loss: 0.13647826608178823
Epoch 8000 - Loss: 0.13931739606063548
Epoch 9000 - Loss: 0.12815559170115878
Epoch 10000 - Loss: 0.13194109693590955

Evaluation¶

In [25]:

Copied!

# produce predictions
y_pred = model.forward(X_data)
# produce predictions
y_pred = model.forward(X_data)

In [26]:

Copied!

# extract the most confident categorical prediction
np.argmax(y_pred, axis=1)
# extract the most confident categorical prediction
np.argmax(y_pred, axis=1)

Out[26]:

array([5, 1, 6, ..., 1, 0, 3], shape=(4500,))

In [27]:

Copied!





# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1), cmap='coolwarm')
plt.axis('square')
ax = plt.gca()
None
# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1), cmap='coolwarm')
plt.axis('square')
ax = plt.gca()
None

In [28]:

Copied!

plot_decision_boundary('Decision boundary of nine circles', s=35, categorical=True)
plot_decision_boundary('Decision boundary of nine circles', s=35, categorical=True)

This does not look that bad. Can we do better?

Training improvements¶

Yes, we can. In fact, all the previous examples in the notebooks have been naive, in the sense that they have overlooked two very common improvements in model training: momentum and batching.

In simple terms:

momentum: if, during successive training steps, a given parameter of the network consistently changes in the same direction (either increasing or decreasing), momentum accelerates this change. This can greatly increase learning speed, especially in situations where we need to use a low base learning rate (lr).
batching: instead of computing a single update using the entire dataset, we split the data into batches of N samples and perform one training step per batch. While the gradient computed from each mini-batch is a noisier estimate of the true gradient, the increased number of updates per epoch often compensates for this, resulting in improved training performance. This is where the stochastic in stochastic gradient descent (SGD) comes from.

Below, we will compare the performance (expressed a loss vs training steps) between the baseline *default" model that is shown above, and momentum and batching improvements. The model architecture will be kept constant between tests.

In [29]:

Copied!





def test_model(X_data, one_hot, momentum, batch_size=-1):
    # define model
    list_layers = [Dense(2, 32),  Sigmoid(),
                   Dense(32, 9), Sigmoid(),]
    loss = BCE()
    optimizer = SGD(lr=0.001, momentum=momentum)
    model = Sequential(list_layers, loss, optimizer)

    # fit the model to the data
    history = model.fit(X_data, one_hot, 10000, debug_flag=False, batch_size=batch_size)

    return history, model
def test_model(X_data, one_hot, momentum, batch_size=-1):
    # define model
    list_layers = [Dense(2, 32),  Sigmoid(),
                   Dense(32, 9), Sigmoid(),]
    loss = BCE()
    optimizer = SGD(lr=0.001, momentum=momentum)
    model = Sequential(list_layers, loss, optimizer)

    # fit the model to the data
    history = model.fit(X_data, one_hot, 10000, debug_flag=False, batch_size=batch_size)

    return history, model

In [30]:

Copied!





# store the training history and model for each test
dict_histories = {}
dict_histories['Default'] = (history, model)
dict_histories['Momentum'] = test_model(X_data, one_hot, momentum=0.35, batch_size=-1)
dict_histories['Batching'] = test_model(X_data, one_hot, momentum=0.0, batch_size=450)
dict_histories['Momentum + Batching'] = test_model(X_data, one_hot, momentum=0.75, batch_size=450)
# store the training history and model for each test
dict_histories = {}
dict_histories['Default'] = (history, model)
dict_histories['Momentum'] = test_model(X_data, one_hot, momentum=0.35, batch_size=-1)
dict_histories['Batching'] = test_model(X_data, one_hot, momentum=0.0, batch_size=450)
dict_histories['Momentum + Batching'] = test_model(X_data, one_hot, momentum=0.75, batch_size=450)

In [31]:

Copied!





# compare the loss vs epoch plot for each training strategy
for name, (history, model) in dict_histories.items():    
    plt.plot(history['loss'].mean(axis=1), label=name)
plt.legend()
plt.ylim([0.075, 0.35])
plt.xlabel("Epoch")
plt.ylabel("Loss")
None
# compare the loss vs epoch plot for each training strategy
for name, (history, model) in dict_histories.items():    
    plt.plot(history['loss'].mean(axis=1), label=name)
plt.legend()
plt.ylim([0.075, 0.35])
plt.xlabel("Epoch")
plt.ylabel("Loss")
None

Besides the training loss curves, we also want to inspect the final result: how well does the trained model reproduce the training data? For this, we will compute a scoring metric known as accuracy, which can be simply thought of as the ratio of correctly classified samples.

In [32]:

Copied!





fig, axarr = plt.subplots(2, 2, figsize=(7,7))

for i, (name, (history, model)) in enumerate(dict_histories.items()):
    # determine correc axis
    ax = axarr.flatten()[i]

    # compute predictions and accuracy
    y_pred = model.forward(X_data)
    accuracy = sum(y_true == np.argmax(y_pred, axis=1).reshape(-1, 1)) / y_true.shape[0]
    accuracy = accuracy.item()

    # graph predictions
    ax.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1), cmap='coolwarm')
    ax.set_title(f"{name} : {accuracy:.2f}")
    ax.set_xticks([])
    ax.set_yticks([])
    ax.axis('equal')
plt.tight_layout()
fig, axarr = plt.subplots(2, 2, figsize=(7,7))

for i, (name, (history, model)) in enumerate(dict_histories.items()):
    # determine correc axis
    ax = axarr.flatten()[i]

    # compute predictions and accuracy
    y_pred = model.forward(X_data)
    accuracy = sum(y_true == np.argmax(y_pred, axis=1).reshape(-1, 1)) / y_true.shape[0]
    accuracy = accuracy.item()

    # graph predictions
    ax.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1), cmap='coolwarm')
    ax.set_title(f"{name} : {accuracy:.2f}")
    ax.set_xticks([])
    ax.set_yticks([])
    ax.axis('equal')
plt.tight_layout()

Model improvements¶

Up until now we have been modelling the multiclass classification problem as independent categorical predictions. Although it is not conceptually correct (in these examples, the categories are exclusive, only one is correct), we have shown that it works.

On the other hand, the training of this model is relatively unstable. We have had to use a small learning rate (0.001) to prevent gradients from dying or exploding (try to increase the learning rate!), and even then, the loss/epoch curve is jittery. The issue stems from how the model treats the independent predictions; since they are independent, lowering or increasing one does not affect the others, but it does affect the loss computation! This is the price we must pay for a poor choice of neural network architecture.

Note: The multi-neuron sigmoid output would be conceptually valid for a multilabel classification task, where each sample can belong to multiple labels.

Luckily for us, there are alternatives. The problems highlighted here have motivated the development of the softmax activation layer, and its corresponding loss function, the categorical cross-entropy (CCE), a multiclass variant of binary cross-entropy. Combined, these two correctly representing a multiclass classification problem, where the sum of probabilities must be 1 (and thus, increasing one, must decrease the others).

In [33]:

Copied!





# define the model
list_layers = [Dense(2, 32),  Sigmoid(),
               Dense(32, 9), Softmax(),
              ]
loss = CCE()

optimizer = SGD(0.01, momentum=0.9)
model = Sequential(list_layers, loss, optimizer)
model.summary()
# define the model
list_layers = [Dense(2, 32),  Sigmoid(),
               Dense(32, 9), Softmax(),
              ]
loss = CCE()

optimizer = SGD(0.01, momentum=0.9)
model = Sequential(list_layers, loss, optimizer)
model.summary()

Model layers:
    * Dense_0  | Dimensions: 2 x 32 | Parameters: 96
    * Sigmoid_1 
    * Dense_2  | Dimensions: 32 x 9 | Parameters: 297
    * Softmax_3 
    --------------------
    Total parameters: 393

In [34]:

Copied!

# fit the model to the data
history_softmax = model.fit(X_data, one_hot, 5000, batch_size=450, debug_flag=True)
# fit the model to the data
history_softmax = model.fit(X_data, one_hot, 5000, batch_size=450, debug_flag=True)

Epoch 1 - Loss: 2.2977707259796505
Epoch 500 - Loss: 1.1347379975163048
Epoch 1000 - Loss: 0.8713382981687708
Epoch 1500 - Loss: 0.7409672716719414
Epoch 2000 - Loss: 0.6571598310857691
Epoch 2500 - Loss: 0.596364809471667
Epoch 3000 - Loss: 0.5493789666165066
Epoch 3500 - Loss: 0.5113401743042799
Epoch 4000 - Loss: 0.48008742446965613
Epoch 4500 - Loss: 0.4532290265608863
Epoch 5000 - Loss: 0.43047442829646254

In [35]:

Copied!





# visualize loss during training
plt.plot(history_softmax['loss'].mean(axis=1))
plt.xlabel("Epoch")
plt.ylabel("Loss")
None
# visualize loss during training
plt.plot(history_softmax['loss'].mean(axis=1))
plt.xlabel("Epoch")
plt.ylabel("Loss")
None

Notice how, even using a higher learning rate, the loss/epoch curve is smooth and continuous. Also, the slope at the end suggests that we could keep training the model to improve even further the performance.

In [36]:

Copied!

# produce predictions
y_pred = model.forward(X_data)
# produce predictions
y_pred = model.forward(X_data)

In [37]:

Copied!





# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1), cmap='coolwarm')
accuracy = sum(y_true == np.argmax(y_pred, axis=1).reshape(-1, 1)) / y_true.shape[0]
accuracy = accuracy.item()

plt.title(f"CCE + {name} : {accuracy:.2f}")
plt.axis('square')
ax = plt.gca()
None
# graph predictions
plt.scatter(X_data[:, 0], X_data[:, 1], c=np.argmax(y_pred, axis=1), cmap='coolwarm')
accuracy = sum(y_true == np.argmax(y_pred, axis=1).reshape(-1, 1)) / y_true.shape[0]
accuracy = accuracy.item()

plt.title(f"CCE + {name} : {accuracy:.2f}")
plt.axis('square')
ax = plt.gca()
None

In [38]:

Copied!

plot_decision_boundary('Decision boundary of nine circles (improved)', s=35, categorical=True)
plot_decision_boundary('Decision boundary of nine circles (improved)', s=35, categorical=True)