Fashion MNIST Classification Using Deep LearningFashion MNIS

Fashion MNIST Classification Using Deep Learning

1. Problem definition and dataset

This project aims to develop a deep learning model that can classify grayscale images of fashion items into one of ten predefined categories. This classification task uses the Fashion Modified National Institute of Standards and Technology (MNIST) dataset. By accurately identifying the category of each fashion item based on visual features in the 28x28 pixel images, the goal is to develop a robust model for multiclass classification. Establishing this problem as a multiclass classification task informs the approaches for data preprocessing, model architecture, and evaluation methods, ensuring the model can handle multiple output classes and generate appropriate probability distributions for each input image.

1.1 Dataset Description

The Fashion MNIST dataset comprises 70,000 grayscale images, each measuring 28x28 pixels, categorized into 10 distinct fashion items ( Xiao, H., Rasul, K., & Vollgraf, R. 2017). The dataset is divided into 60,000 images for training and 10,000 images for testing (Shi, Chiu, & Xu, 2023). Each image is labelled with one of the following 10 categories: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot (EITCA Academy, 2023). Understanding the dataset's structure is important for effective model training. The balanced distribution across categories ensures that no single class dominates, allowing accuracy to be a reliable metric for performance evaluation.

1.2 Example Images from Each Class

To clarify the data, we present example images from each of the ten classes in the Fashion MNIST dataset. These examples highlight the visual differences between categories and emphasize the challenge of distinguishing between similar-looking items, such as shirts and T-shirts/tops.

import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.datasets import fashion_mnist

# Load the Fashion MNIST dataset
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

# Class names corresponding to labels in the Fashion MNIST dataset
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

# Plot example images from each class in two rows
plt.figure(figsize=(12, 6))
for i in range(10):
    # Find the first occurrence of each class in the training set
    index = np.where(y_train == i)[0][0]
    plt.subplot(2, 5, i + 1)
    plt.imshow(X_train[index], cmap='gray')
    plt.title(class_names[i])
    plt.axis('off')
plt.tight_layout()
plt.show()

2. Success metrics

To evaluate the model’s performance on the Fashion MNIST dataset,I employed multiple metrics, each providing a different perspective on the model's effectiveness.

Accuracy: Accuracy is the proportion of correctly classified instances among the total instances. It is calculated as:

\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}

Given that the Fashion MNIST dataset is well-balanced across its 10 classes, accuracy is an effective metric for assessing the model's overall performance in correctly classifying instances across all categories. This metric provides a broad view of the model’s effectiveness in a multiclass classification context (Bishop, 2006).

Precision: Precision quantifies the proportion of true positive predictions out of all positive predictions made by the model. It is calculated using the following formula:

\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}

Precision is particularly important when the cost of false positives is high as it reflects the model’s ability to avoid misclassifications into a specific class. In the context of Fashion MNIST, precision helps in understanding how well the model distinguishes between similar fashion items, reducing incorrect assignments (Bishop, 2006).

Recall: Recall, also known as sensitivity, is the proportion of correctly predicted positive instances out of all actual positive instances in a dataset. It is calculated as:

\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}

Recall is important for ensuring that the model captures as many true instances as possible, minimizing false negatives. For the Fashion MNIST dataset, recall provides insight into the model's ability to correctly identify instances across all ten classes, especially for categories that may be harder to distinguish (Bishop, 2006).

AUC (Area Under the Curve): The AUC (Area Under the Curve) metric assesses the model’s capability to differentiate between classes over a range of threshold values. Although AUC is typically used in binary classification, it can be extended to multiclass problems using the one-vs-rest approach. A higher AUC indicates better performance in distinguishing between classes (Khaniki, Golkarieh, & Manthouri, 2024). The AUC is calculated as the area under the receiver operating characteristic (ROC) curve.

\text{AUC} = \int_{0}^{1} \text{TPR}(\text{FPR}) \, d(\text{FPR})

Here, TPR is the true positive rate, and FPR is the false positive rate. AUC provides an overall measure of the model's ability to discriminate between different classes (Fawcett, 2006) (Alberti, Grima, & Vella, 2018).

Using these metrics provides a thorough insight into the model’s overall performance. Accuracy provides a general measure of correctness, while precision and recall offer insights into the model's performance in specific class contexts. AUC offers an extra dimension of assessment by evaluating the model’s capability to differentiate between classes at various threshold levels. Together, these metrics ensure a robust evaluation of the model’s capabilities on the Fashion MNIST dataset.

3. Evaluation protocol

The evaluation protocol involves splitting the training data into 80 per cent for training and 20 per cent for validation, with the test set kept separate. The test set, containing 10,000 images, is not used during the training process and serves as an independent dataset for final evaluation. Hold-out validation is selected because the dataset is large enough to allocate a significant portion for validation, while still having ample data for training. This approach is both computationally efficient and straightforward, which makes it ideal for rapid experimentation and iterative model development (Goodfellow et al., 2016). By using hold-out validation, we ensure that the model’s performance is continually evaluated on unseen data, providing a reliable measure of its generalization capability. This helps prevent overfitting, as the model’s effectiveness is constantly tested against data it has not encountered during training.

4. Data preparation

Effective data preparation is crucial for ensuring that the model can learn efficiently and achieve high performance. In this project, we take the following steps to prepare the Fashion MNIST dataset for input into the neural network.

4.1 Data Loading and Normalization

We utilized TensorFlow's built-in utilities to load the Fashion MNIST dataset, which consists of 70,000 grayscale images. Each pixel's value ranges from 0 to 255. For effective model training, these pixel values were normalized to a range of [0, 1] by dividing each pixel value by 255. This normalization is critical, as it ensures that all input features are on a similar scale, which helps in accelerating the convergence of gradient-based optimizers and improves training stability.

4.2 Reshaping Data for Neural Network Input

Initially, the images have a shape of (28, 28) due to their grayscale nature. To ensure compatibility with our neural network architecture, the images were reshaped to include a channel dimension, resulting in a shape of (28, 28, 1). Although we are employing fully connected layers, this step is important as it aligns the data format with potential future use of convolutional layers, which are common in image processing tasks.

4.3 One-Hot Encoding of Labels

The dataset contains ten distinct classes. To facilitate the use of categorical cross-entropy as the loss function, the integer labels were converted into one-hot encoded vectors. This encoding transforms each label into a binary vector where the index corresponding to the class is marked with a 1, and all other indices are 0. This transformation is essential for the model to output a probability distribution across all classes and for calculating the loss during training.

4.4 Splitting the Data

The dataset was divided into three sets: training, validation, and test sets. The training set consisted of 80% of the original training data and was used for learning the patterns from the data. The validation set, comprising the remaining 20% of the original training data, was used to tune hyperparameters and monitor the model's performance, allowing early detection of overfitting. The test set, containing 10,000 images, remained unseen during the training process and was used to evaluate the final model's performance, providing an unbiased estimate of its generalization capability.

4.5 Code Implementation

The following code performs these operations:

import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

# Set TensorFlow logging to only show errors
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.metrics import Precision, Recall, AUC
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, LearningRateScheduler
from tensorflow.keras.optimizers import Adam


# Load the Fashion MNIST dataset
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

#------- Deep Learning with Python, Chollet, 2018, page 302 START ------#
# Normalize the pixel values to be between 0 and 1
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0
#------- Deep Learning with Python, Chollet, 2018, page 302 END ------#

# Reshape the data to include the channel dimension
X_train = X_train.reshape(-1, 28, 28, 1)
X_test = X_test.reshape(-1, 28, 28, 1)

# One-hot encode the labels
one_hot_y_train = tf.keras.utils.to_categorical(y_train, 10)
one_hot_y_test = tf.keras.utils.to_categorical(y_test, 10)

# Display the new shapes of the data
print("Training data shape after preprocessing:", X_train.shape, one_hot_y_train.shape)
print("Test data shape after preprocessing:", X_test.shape, one_hot_y_test.shape)

# Split the training data into training and validation sets
train_size = int(0.8 * X_train.shape[0])

X_train_final = X_train[:train_size]
one_hot_y_train_final = one_hot_y_train[:train_size]

X_val = X_train[train_size:]
one_hot_y_val = one_hot_y_train[train_size:]

# Display the shapes of the splits
print("Training set shape:", X_train_final.shape, one_hot_y_train_final.shape)
print("Validation set shape:", X_val.shape, one_hot_y_val.shape)
print("Test set shape:", X_test.shape, one_hot_y_test.shape)

5. Gaining statistical power

5.1Implementation

To establish a meaningful baseline for the Fashion MNIST classification task, we implemented two models: a dumb baseline and a minimal neural network model.

Dumb Baseline: We established the dumb baseline by randomly shuffling the test labels and then measuring the accuracy of a random classifier. This method represents the simplest model, showing the accuracy that can be achieved through random guessing without any learned insights from the data.

Minimal Neural Network Model: We designed a minimal neural network model with a single hidden layer containing 256 neurons to balance model complexity and computational efficiency, providing enough capacity to learn complex patterns without overfitting. The batch size of 256 was chosen to optimize training efficiency and hardware utilization. The hidden layer used the ReLU (Rectified Linear Unit) activation function, which allows the network to capture non-linear relationships by passing only positive values through. For the output layer, we employed the softmax activation function to convert the network's output into a probability distribution over the ten classes, making it ideal for multiclass classification tasks.

5.2 Outcome

Dumb Baseline Accuracy: The random baseline model achieved an accuracy of approximately 10.06%. This result is expected, as the Fashion MNIST dataset consists of 10 classes, and random guessing would statistically yield a 10% accuracy rate.

Minimal Model Accuracy: The minimal neural network model significantly outperformed the dumb baseline, achieving a training accuracy of approximately 88.47%. This strong performance demonstrates the model's ability to learn from the data and effectively classify the images into the correct categories.

5.3 Analysis:

The comparison between the dumb baseline and the minimal neural network model underscores the statistical power of the dataset. The dumb baseline, with its 10.06% accuracy, establishes the lowest possible benchmark for classification accuracy. In contrast, the minimal model's 88.47% accuracy clearly demonstrates that the dataset contains sufficient information for effective learning and classification using a neural network.

The substantial improvement from the dumb baseline to the minimal model highlights the feasibility of the classification task. The reasonable accuracy achieved by the minimal model suggests that the Fashion MNIST dataset is informative and that the relationship between input images and their corresponding labels can be effectively learned. The close alignment between training and validation accuracy also indicates that the model generalizes well, even without the application of advanced regularization techniques. This baseline performance provides a critical reference point for further exploration with more complex models.

baseline model (Deep Learning with Python, Chollet, 2018, p. 83)

# DLWP 3.21 page 83
import copy
y_test_copy = copy.copy(y_test)
np.random.shuffle(y_test_copy)
hits_array = np.array(y_test_copy) == np.array(y_test)
float(np.sum(hits_array)) / len(y_test)

Minimal Neural Network Model:

# Define the minimal baseline model
minimal_model = Sequential([
    Flatten(input_shape=(28, 28, 1)),  # Flatten the input image
    Dense(256, activation='relu'),     # Single hidden layer with 256 units
    Dense(10, activation='softmax')    # Output layer with softmax activation for multiclass classification
])

# Compile the minimal model
minimal_model.compile(optimizer='adam', 
                      loss='categorical_crossentropy', 
                      metrics=['accuracy', 
                       Precision(name='precision'), 
                       Recall(name='recall'), 
                       AUC(name='auc')])

# Train the minimal model
history = minimal_model.fit(X_train, one_hot_y_train, 
                            epochs=20, 
                            batch_size=256)

# Evaluate the model on the test set
results = minimal_model.evaluate(X_test, one_hot_y_test)
print(f"Test Accuracy: {results[1]:.4f}")

6. Scaling up

The scaling-up process aimed to determine whether increasing model complexity could improve performance on the Fashion MNIST dataset. Despite the risk of overfitting, exploring deeper architectures can reveal insights into the model's capacity to learn complex patterns.

6.1 Training Process

The scaled-up model was trained over 100 epochs with a batch size of 256. Early in the training, the model showed significant improvements, with the training accuracy starting at 72.86% and AUC at 96.06%. By the final epoch, the model achieved a peak accuracy of 99.21%, precision of 99.23%, recall of 99.20%, and an AUC of 99.97%. However, the validation loss began to increase after 9 epochs, signalling the onset of overfitting, while the training loss continued to decrease. This indicates that while the model was able to fit the training data very well, its performance on unseen data was beginning to deteriorate, a classic sign of overfitting.

6.2 Overfitting Identification

To address overfitting, training was halted after 9 epochs; at this point, the validation accuracy was 89.48%, with the validation precision at 91.10% and the validation recall at 88.17%. These metrics indicate that the model effectively identified relevant instances (high precision) and captured the truest instances (high recall). However, the gap between training and validation metrics highlighted the model's limited ability to generalise beyond the training data.

6.3 Final Model Evaluation

The final scaled-up model was evaluated on the test set, achieving a test accuracy of 88.19%, precision of 90.11%, recall of 86.61%, and an AUC of 99.16%. This suggests several key insights:

Precision: The high precision indicates that the model is quite effective at minimizing false positives. This is crucial in applications where misclassifying an item into an incorrect category could lead to significant downstream errors or costs.

Recall: The recall value, slightly lower than precision, suggests that while the model is good at identifying true instances, there is still room for improvement in capturing all relevant instances. A lower recall might lead to missing out on correctly identifying certain classes, especially those that are visually similar, such as T-shirts and shirts.

AUC: The high AUC indicates that the model is very good at distinguishing between classes overall, even across different decision thresholds. This is a strong indicator of the model’s robustness and its ability to handle diverse scenarios with varying degrees of class separation.

Accuracy: Although accuracy is a general measure of performance, it doesn't capture the nuances of precision and recall. The slight drop in accuracy from the minimal model to the scaled-up model suggests that adding complexity did not necessarily translate to better overall performance.

6.4 Introducing Regularization and Tuning

This analysis underscores a important insight: increased model complexity does not inherently guarantee better performance across all metrics. The minimal model's higher accuracy compared to the scaled-up model highlights its balanced performance. However, the scaled-up model's strong precision and AUC suggest it could be more reliable in correctly classifying items when fewer errors are tolerated.

To enhance the scaled-up model's generalization capabilities, further strategies, such as regularization techniques (e.g., dropout, L2 regularization) and hyperparameter tuning should be explored. These methods can help reduce overfitting and potentially improve both recall and accuracy, aligning the model's performance across all key metrics. By carefully managing complexity and focusing on regularization, the goal is to create a model that not only captures complex patterns but also maintains robustness and reliability across diverse data sets and classification tasks.

Code performs Large Model Architecture and Training for Overfitting Detection

# Model architecture 
def build_large_model():
    model = Sequential([
        Flatten(input_shape=(28, 28, 1)),  # Flatten the input image
        Dense(1280, activation='relu', input_shape=(28 * 28,)),
        Dense(640, activation='relu'),
        Dense(320, activation='relu'),
        Dense(160, activation='relu'),
        Dense(10, activation='softmax')
        ])
    return model


large_model = build_large_model()
# Compile the model with a fixed, slightly lower learning rate
large_model.compile(optimizer="adam",
              loss='categorical_crossentropy',
              metrics=['accuracy', 
                       Precision(name='precision'), 
                       Recall(name='recall'), 
                       AUC(name='auc')])

# Train the model
history = large_model.fit(
    X_train_final, one_hot_y_train_final,
    epochs=100,
    batch_size=256,
    validation_data=(X_val, one_hot_y_val),
    shuffle=True
)

Function of Plotting results

import matplotlib.pyplot as plt

def plot_history(history):
    # Set up the subplots: 2 rows, 3 columns
    fig, axs = plt.subplots(2, 3, figsize=(18, 12))  # 2 rows, 3 columns

    # Plot accuracy
    epochs = range(1, len(history.history['accuracy']) + 1)
    axs[0, 0].plot(epochs, history.history['accuracy'], color='blue', linestyle='-', linewidth=2, marker='o')
    axs[0, 0].plot(epochs, history.history['val_accuracy'], color='orange', linestyle='-', linewidth=2, marker='o')
    axs[0, 0].set_title('Model Accuracy', fontsize=16)
    axs[0, 0].set_ylabel('Accuracy', fontsize=14)
    axs[0, 0].set_xlabel('Epoch', fontsize=14)
    axs[0, 0].legend(['Training Accuracy', 'Validation Accuracy'], loc='lower right', fontsize=12)
    axs[0, 0].grid(True)
    axs[0, 0].tick_params(axis='both', which='major', labelsize=12)

    # Plot loss
    axs[0, 1].plot(epochs, history.history['loss'], color='red', linestyle='-', linewidth=2, marker='o')
    axs[0, 1].plot(epochs, history.history['val_loss'], color='green', linestyle='-', linewidth=2, marker='o')
    axs[0, 1].set_title('Model Loss', fontsize=16)
    axs[0, 1].set_ylabel('Loss', fontsize=14)
    axs[0, 1].set_xlabel('Epoch', fontsize=14)
    axs[0, 1].legend(['Training Loss', 'Validation Loss'], loc='upper right', fontsize=12)
    axs[0, 1].grid(True)
    axs[0, 1].tick_params(axis='both', which='major', labelsize=12)

    # Plot precision
    axs[0, 2].plot(epochs, history.history['precision'], color='purple', linestyle='-', linewidth=2, marker='o')
    axs[0, 2].plot(epochs, history.history['val_precision'], color='brown', linestyle='-', linewidth=2, marker='o')
    axs[0, 2].set_title('Model Precision', fontsize=16)
    axs[0, 2].set_ylabel('Precision', fontsize=14)
    axs[0, 2].set_xlabel('Epoch', fontsize=14)
    axs[0, 2].legend(['Training Precision', 'Validation Precision'], loc='lower right', fontsize=12)
    axs[0, 2].grid(True)
    axs[0, 2].tick_params(axis='both', which='major', labelsize=12)

    # Plot recall
    axs[1, 0].plot(epochs, history.history['recall'], color='teal', linestyle='-', linewidth=2, marker='o')
    axs[1, 0].plot(epochs, history.history['val_recall'], color='magenta', linestyle='-', linewidth=2, marker='o')
    axs[1, 0].set_title('Model Recall', fontsize=16)
    axs[1, 0].set_ylabel('Recall', fontsize=14)
    axs[1, 0].set_xlabel('Epoch', fontsize=14)
    axs[1, 0].legend(['Training Recall', 'Validation Recall'], loc='lower right', fontsize=12)
    axs[1, 0].grid(True)
    axs[1, 0].tick_params(axis='both', which='major', labelsize=12)

    # Plot AUC
    axs[1, 1].plot(epochs, history.history['auc'], color='cyan', linestyle='-', linewidth=2, marker='o')
    axs[1, 1].plot(epochs, history.history['val_auc'], color='darkorange', linestyle='-', linewidth=2, marker='o')
    axs[1, 1].set_title('Model AUC', fontsize=16)
    axs[1, 1].set_ylabel('AUC', fontsize=14)
    axs[1, 1].set_xlabel('Epoch', fontsize=14)
    axs[1, 1].legend(['Training AUC', 'Validation AUC'], loc='lower right', fontsize=12)
    axs[1, 1].grid(True)
    axs[1, 1].tick_params(axis='both', which='major', labelsize=12)

    # Hide the empty subplot (bottom right)
    axs[1, 2].axis('off')

    # Adjust layout
    plt.tight_layout()
    plt.show()

Plot the large model performance

plot_history(history)

Code for Final Large Model Compilation, Training, and Evaluation

final_large_model = build_large_model()
# Compile the model with a fixed, slightly lower learning rate
final_large_model.compile(optimizer="adam",
              loss='categorical_crossentropy',
              metrics=['accuracy', 
                       Precision(name='precision'), 
                       Recall(name='recall'), 
                       AUC(name='auc')])

# Train the model
history = final_large_model.fit(
    X_train, one_hot_y_train,
    epochs=9,
    batch_size=256,
    shuffle=True
)

final_large_model_results = final_large_model.evaluate(X_test, one_hot_y_test)
# Print each result with a description
print(f"Test Loss: {final_large_model_results[0]:.4f}")
print(f"Test Accuracy: {final_large_model_results[1]:.4f}")
print(f"Test Precision: {final_large_model_results[2]:.4f}")
print(f"Test Recall: {final_large_model_results[3]:.4f}")
print(f"Test AUC: {final_large_model_results[4]:.4f}")

7. Regularisation and tuning

To address the overfitting observed in the scaled-up model (as discussed in the previous section), we implemented targeted regularization techniques and tuning strategies to enhance the model's generalization capabilities. This approach involved introducing dropout layers, applying L2 regularization, and implementing learning rate scheduling.

7.1 The techniques to optimize the model

Dropout Layers:

We implemented dropout at various points in the network, progressively increasing the dropout rates in the deeper layers. This approach involved randomly deactivating a fraction of the neurons during training to prevent the model from becoming overly reliant on specific neurons. By doing so, dropout encourages the network to learn more distributed representations, which contributes to developing a more robust and generalizable model (Srivastava et al., 2014).

L2 Regularization:

To discourage the model from forming overly complex hypotheses, we applied L2 regularization across the dense layers. This technique adds a penalty to the loss function proportional to the square of the weights, which helps in simplifying the model. By minimizing overfitting, L2 regularization enhances the model's ability to generalize to new, unseen data.

Learning Rate Scheduling:

We utilized a learning rate scheduler to gradually decrease the learning rate as training progressed. Beginning with a higher learning rate allowed for rapid initial progress, while the gradual reduction enabled finer adjustments during later stages of training. We implemented a more aggressive learning rate decay after 20 epochs to improve the model's ability to converge effectively and avoid overshooting during optimization. By integrating these regularization techniques and tuning strategies, we aimed to reduce overfitting and achieve a closer alignment between training and validation metrics. This combination was expected to improve the model's generalization capabilities, ensuring it performs effectively on unseen data while maintaining stability and convergence throughout the training process.

7.2 Train Process and Outcome Analysis

The training process of the optimized model involved several key stages, each marked by specific changes in performance metrics.

Early Epochs: In the early epochs (1–10), the model demonstrated rapid improvements in both training and validation performance. Accuracy started at 43.04% and increased to 86.55%, while validation accuracy rose from 74.48% to 87.83%. The corresponding precision increased from 72.14% to 90.61%, and recall jumped from 19.41% to 85.42%, showing the model's growing ability to identify relevant items. AUC, which measures the model’s ability to distinguish between classes, improved from 85.38% to 99.13%. The loss values also reflected steady improvement, with training loss decreasing from 1.9286 to 0.7095, and validation loss reducing from 1.0757 to 0.6553. These trends suggest that the initial high learning rate and the regularization techniques effectively boosted the model's learning without overfitting.

Mid Training: During the middle phase (epochs 11–20), the model's performance continued improving but at a slower pace, as the learning rate scheduler began reducing the learning rate. By epoch 20, the model achieved a training accuracy of 90.12% and validation accuracy of 88.83%. Precision and recall reached 92.19% and 87.85%, respectively, while AUC remained high at 99.38%. The loss values also followed a downward trend, with training loss decreasing to 0.5722 and validation loss dropping to 0.5985. This stage showed the model's increased ability to generalize to unseen data, as evidenced by the decreasing validation loss and the stability of precision, recall, and AUC metrics.

Later Epochs: In the later epochs (21–40), the model's accuracy continued to improve, stabilizing at around 92.40% in training and 89.72% in validation. Precision and recall were consistently high, at approximately 93.9% and 91.6%, respectively, while AUC remained robust at 99.6%. Training loss decreased further to 0.4752, with validation loss stabilizing at around 0.5687. The aggressive learning rate decay at this stage, coupled with early stopping mechanisms, helped prevent overfitting while fine-tuning the model’s performance across all metrics.

Epoch 50 (Lowest Validation Loss): At epoch 50, the model achieved an accuracy of 92.84%, with a loss of 0.4757. The validation accuracy was 89.88%, and the validation loss was 0.5678, showing a good balance between performance on the training and validation sets. The validation precision at this epoch was 90.97%, and the validation recall was 89.20%, which confirmed the model’s ability to make accurate predictions while minimizing false positives and negatives. The validation AUC of 99.23% further reflected the model’s high capacity to distinguish between the classes across varying classification thresholds.

Final Epochs (51–80): After the lowest validation loss at epoch 50, the model continued training until epoch 80. While validation accuracy and loss remained relatively stable, showing no significant improvement beyond this point, the early stopping mechanism triggered at epoch 80 to avoid unnecessary overfitting. At epoch 80, the model maintained an accuracy of 93.05% and a validation accuracy of 89.87%. The final metrics, including the validation precision of 90.98% and the validation recall of 89.22%, confirmed the model's reliability in classification. Although the validation loss plateaued, the consistent performance across metrics indicated robust generalization. The learning rate decay and regularization techniques played a key role in achieving this balance between accuracy and loss stabilization.

Outcome Analysis

The use of dropout layers, L2 regularization, and learning rate scheduling effectively mitigated the overfitting issue observed in earlier models. The validation loss consistently decreased throughout training, and the validation accuracy closely tracked the training accuracy, with only a small gap (around 1-2%) indicating well-generalized learning. The high AUC values (above 99%) demonstrated that the model was effective at distinguishing between the classes. Despite the improvements, a slight increase in validation loss was observed after epoch 33, but the model remained robust with no signs of dramatic overfitting. The precision and recall metrics also showed a balanced trade-off, indicating that the model was not biased towards false positives or negatives. The regularization techniques combined with learning rate scheduling helped ensure the model not only converged to an optimal solution but also generalized effectively to new, unseen data, achieving strong performance across multiple metrics.

Code for Optimized Model Architecture, Compilation, and Training with Regularization and Learning Rate Scheduling

def build_optimized_model():
    model = Sequential([
        Flatten(input_shape=(28, 28, 1)),
        Dense(2048, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.0001)),
        Dropout(0.2),
        Dense(1024, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.0001)),
        Dropout(0.3),
        Dense(512, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.0001)),
        Dropout(0.3),
        Dense(256, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.00005)),
        Dropout(0.4),
        Dense(128, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.00005)),
        Dropout(0.5),
        Dense(10, activation='softmax')
    ])
    return model

# Compile the model with Adam optimizer
optimized_model = build_optimized_model()
optimized_model.compile(optimizer=Adam(learning_rate=0.00005),
                        loss='categorical_crossentropy',
                        metrics=['accuracy', 
                                 Precision(name='precision'), 
                                 Recall(name='recall'), 
                                 AUC(name='auc')])

# Learning Rate Schedule - More aggressive decay
def scheduler(epoch, lr):
    if epoch < 20:
        return lr
    else:
        return float(lr * 0.85)  # More aggressive decay

lr_scheduler = LearningRateScheduler(scheduler)

# Set up early stopping, learning rate reduction, and learning rate scheduler
early_stopping = EarlyStopping(monitor='val_loss', patience=30, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=10, min_lr=1e-6)


# Train the model
history = optimized_model.fit(X_train_final, one_hot_y_train_final,
                    epochs=100,
                    batch_size=256,
                    validation_data=(X_val, one_hot_y_val),
                    callbacks=[early_stopping, reduce_lr, lr_scheduler],
                    shuffle=True,
                    verbose=2)

Plot the optimized model performance

plot_history(history)

8. Evaluation

In this final step, we rigorously evaluated the optimized model on the unseen test set to measure its generalization capability and overall performance. Our evaluation aimed to confirm whether the improvements made through regularization and tuning effectively enhanced the model’s ability to generalize to new data.

8.1 Implementation

To enhance model performance, dropout, L2 regularization, and learning rate scheduling were incorporated. The model was initially trained on a combined training and validation dataset for 100 epochs, with early stopping triggered at epoch 80 due to plateauing validation loss. The lowest validation loss of 0.5678 was achieved at epoch 50, which was selected as the optimal point for evaluation. Consequently, training for the final evaluation was limited to 50 epochs to prevent overfitting and unnecessary training. The model was then rigorously tested on an unseen test set, ensuring accurate real-world performance evaluation.

8.2 Outcome

The model achieved a test accuracy of 90.66%, closely aligning with the validation accuracy observed during training. This consistency indicates that the model successfully learned to generalize beyond the training data. The evaluation also yielded a test precision of 91.24%, a recall of 90.20%, and an AUC of 98.88%. These metrics collectively suggest that the model is both accurate and reliable across multiple dimensions.

Precision (91.24%): The high precision indicates that the model is effective at correctly identifying fashion items as relevant to their actual class. This means the model made relatively few false positive predictions, which is crucial in scenarios where incorrect classifications could lead to significant consequences.

Recall (90.20%): The high recall shows that the model can successfully identify a large proportion of true positives, capturing most of the relevant items across all classes. This is especially important in applications where missing a relevant item (false negatives) could be problematic.

AUC (98.88%): The high AUC score further demonstrates the model's strong ability to distinguish between different classes. This metric, which is indicative of the model's performance across various classification thresholds, suggests that the model is robust and capable of handling diverse scenarios effectively.

Despite these positive outcomes, the test loss was recorded at 0.5277, which is higher than the loss observed in the earlier scaled-up model from Section 6. This increase in test loss, despite achieving a high accuracy, can be attributed to the application of regularization techniques such as L2 regularization, which penalizes larger weights. While this results in higher loss values due to the added penalty terms, it ultimately improves the model’s generalization by preventing overfitting.

8.3 Analysis

The alignment between validation and test metrics confirms that the regularization strategies, including dropout and L2 regularization, were effective in preventing overfitting. The slightly higher test loss, when compared with the robust precision, recall, and AUC scores, underscores the delicate balance between overfitting and maintaining model accuracy.

The high precision and recall metrics suggest that the model performs well not only in correctly identifying classes but also in capturing a broad range of relevant items without excessive false positives or false negatives. The AUC score supports this, indicating that the model can reliably distinguish between different classes. This combination of high precision, recall, and AUC demonstrates that the model is well-suited for real-world applications, where both accuracy and reliability across varying conditions are critical.

Overall, these results confirm that the combination of regularization and tuning strategies has led to a robust, high-performing model capable of generalizing well to new data. This evaluation supports the use of these techniques in developing deep learning models for complex multiclass classification tasks.

Final Optimized Model Training and Evaluation

final_optimized_model = build_optimized_model()

final_optimized_model.compile(optimizer=Adam(learning_rate=0.0001),
                        loss='categorical_crossentropy',
                        metrics=['accuracy', 
                                 Precision(name='precision'), 
                                 Recall(name='recall'), 
                                 AUC(name='auc')])


# Train the model on the combined training and validation data
history = final_optimized_model.fit(X_train, one_hot_y_train,
                              epochs=50,
                              batch_size=256,
                              callbacks=[early_stopping, reduce_lr, lr_scheduler],
                              shuffle=True,
                              verbose=2)

# Evaluate the model on the test set
final_optimized_model_results = final_optimized_model.evaluate(X_test, one_hot_y_test, verbose=2)
# Print each result with a description
print(f"Test Loss: {final_optimized_model_results[0]:.4f}")
print(f"Test Accuracy: {final_optimized_model_results[1]:.4f}")
print(f"Test Precision: {final_optimized_model_results[2]:.4f}")
print(f"Test Recall: {final_optimized_model_results[3]:.4f}")
print(f"Test AUC: {final_optimized_model_results[4]:.4f}")

9. Conclusion

9.1 Summary of Results

This project successfully demonstrated the use of deep learning techniques to classify the Fashion MNIST dataset. The final model achieved a notable test accuracy of 90.66%, showcasing the ability to generalize effectively to unseen data. A systematic approach, involving baseline model establishment, architecture scaling, and regularization, proved essential in developing a robust classification model.

9.2 Addressing Challenges and Limitations

The main challenges were overfitting during scaling and distinguishing between visually similar categories. These were addressed through dropout, L2 regularization, and learning rate scheduling. However, exploring more sophisticated techniques like convolutional neural networks (CNNs) and data augmentation could further enhance the model's ability to capture subtle differences, thereby improving accuracy.

9.3 Future Directions

While the current model demonstrated strong performance, future work should focus on refining the model architecture to handle more complex and nuanced datasets. Implementing CNNs could significantly enhance the model's ability to recognize intricate patterns and differences between similar classes. Additionally, exploring techniques like data augmentation and ensemble learning could further boost the model's robustness and accuracy, making it more applicable to real-world scenarios where data variability is higher.

9.4 Final Remarks

This project highlights the importance of a systematic approach to model development in deep learning, emphasizing the need to carefully balance complexity and generalization. The successful classification of the Fashion MNIST dataset using the techniques employed here demonstrates the effectiveness of these methods. The insights gained from this project provide a strong foundation for further exploration and refinement, with potential applications extending beyond the fashion industry to other image classification tasks. References

Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
Shi, C., Chiu, A. K., & Xu, H. (2023). Evaluating designs for hyperparameter tuning in deep neural networks. The New England Journal of Statistics in Data Science, 1, 334–341. doi.org/10.51387/23…
EITCA Academy. (2023, August 2). What is the difference between the Fashion-MNIST dataset and the classic MNIST dataset? Artificial Intelligence, EITC/AI/GCML Google Cloud Machine Learning. eitca.org/artificial-…
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
Khaniki, M. A. L., Golkarieh, A., & Manthouri, M. (2024). Brain tumor classification using Vision Transformer with selective cross-attention mechanism and feature calibration. arXiv. arxiv.org/pdf/2406.17…
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861-874. doi.org/10.1016/j.p…
Alberti, G., Grima, R., & Vella, N. C. (2018). The use of geographic information system and 1860s cadastral data to model agricultural suitability before heavy mechanization: A case study from Malta. PLOS ONE, 13(2), e0192039. doi.org/10.1371/jou…
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Chollet, F. (2018). Deep learning with Python (1st ed., p. 83). Manning Publications.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56), 1929-1958.