Deep Learning for Handwritten Digit Recognition: A PyTorch Approach

12 min readJan 30, 2024

Handwritten digit recognition stands as a crucial task within computer vision, finding utility across various domains such as optical character recognition (OCR) and automated form processing. In this piece, we delve into constructing and training deep learning models for recognizing handwritten digits, employing the well-known MNIST dataset. We will cover the development of basic linear and non-linear models alongside a convolutional neural network (CNN) model, all implemented using PyTorch.

Why Pytorch?

PyTorch has emerged as a leading deep learning framework due to its flexibility, ease of use, and dynamic computation graph. Unlike other frameworks that use static computation graphs, PyTorch adopts a dynamic computation graph approach, allowing for more intuitive model building and debugging.

One of the key advantages of PyTorch is its Pythonic syntax, which makes it easy to write and understand code. This makes PyTorch a preferred choice for researchers and practitioners alike, enabling rapid prototyping and experimentation.

PyTorch also provides seamless integration with popular Python libraries such as NumPy, making it easy to work with multidimensional arrays and perform various data manipulations.

Moreover, PyTorch offers extensive support for GPU acceleration, allowing for efficient training of deep neural networks on powerful hardware. Its automatic differentiation capabilities simplify the process of computing gradients, enabling faster experimentation and model iteration.

Understanding the MNIST Dataset:

The MNIST dataset is a benchmark dataset for image classification tasks, consisting of 28x28 grayscale images of handwritten digits (0–9). With 60,000 training images and 10,000 test images, MNIST provides a standardized dataset for evaluating the performance of machine learning models in handwritten digit recognition tasks.

Preparing the Data for the Training the Model

Before feeding the data to any machine learning model, we have to convert the data into Tensors. In PyTorch, a tensor is a multi-dimensional array used to store and manipulate data efficiently. Tensors are similar to NumPy arrays but come with additional features optimized for deep learning tasks, such as GPU acceleration and automatic differentiation. Let’s import the MNIST dataset using torchvision’s datasets library.

import torch
from torch import nn
import torchvision
from torchvision import datasets
from torchvision import transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Don’t worry there are bunch of other imports which we didn’t talked about will see them later during their actual implementation. For now, it’s imperative that we use the torchvision which is a computer vision library from pytorch. We’ve also imported ‘transforms’ from torchvision, a powerful module that allows us to perform various transformations on our data, such as converting images into tensors, resizing, cropping, normalizing, and even augmenting the data with techniques like rotation and flipping. These transformations are crucial for preparing our raw data, including converting images into grayscale and other formats suitable for training our models. As MNIST dataset is already in a grayscale form, we have only applied ToTensor(), this will effectively convert raw PIL images into Tensors.

transform = transforms.Compose([transforms.ToTensor()])
train_data = datasets.MNIST(root="data", train= True, transform= transform, target_transform= None, download= True)
test_data = datasets.MNIST(root="data", train= False, transform= transform, target_transform= None, download= True)

When visualizing the shape of the image we can see that it is in the form of [channels, height, width]. Which is by default done by Pytorch in contrast PIL images are in the form [width,height, channels].

image, label = train_data[0]
image.shape

We can also use the classes attribute to get the types of classes present in the MNIST dataset.

class_names = train_data.classes
class_names

Visualizing the MNIST Dataset

The ‘torch.randint()’ returns a random tensor between the given range, below we have defined from 0 to the length of the training data which is 60000.

The matplotlib doesn’t support the color channels while visualizing an image, as a result we had to remove the channels dimension, This operation effectively converts the tensor from [channels, height, width] to [height, width], allowing matplotlib to visualize the image correctly.

figure = plt.figure(figsize=(10,7))
rows,cols = 4,4
for i in range(1, rows*cols+1):
  random_index = torch.randint(0, len(train_data), size=[1]).item()
  image, label = train_data[random_index]
  figure.add_subplot(rows,cols,i)
  plt.imshow(image.squeeze())
  plt.title(class_names[label])
  plt.axis(False)

Preparing the Data into Batches

DataLoader in PyTorch simplifies and streamlines the process of handling datasets, contributing to efficient training and evaluation of models by providing features such as batching, shuffling, data transformation, asynchronous loading, and iterable dataset access. We can use DataLoader module by using the ‘torch.utils.data’ library.

Efficient Batch Processing: DataLoader efficiently divides the dataset into batches, enabling models to process multiple samples simultaneously during training. This batching process enhances computational efficiency, particularly on hardware accelerators like GPUs, leading to faster training times.
Shuffling and Randomization: DataLoader supports shuffling the dataset, ensuring that the model does not learn patterns based on the order of samples. This randomization improves the robustness of the model by introducing variability during training.
Data Transformation: DataLoader facilitates data transformations such as normalization, resizing, and augmentation. These transformations are applied dynamically to each batch of data, allowing for efficient preprocessing and augmentation directly within the data loading pipeline.

BATCH_SIZE = 32
train_data_loader = DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)
test_data_loader = DataLoader(dataset=test_data, batch_size=BATCH_SIZE, shuffle=False)

len(train_data_loader), len(test_data_loader)
# (1875, 313)

This has effectively converted the data into batches. That means there are 1875 batches of training images. So in short, in the further section when we cover the Training Loop it will help the model to only train 32 batches of images at a time during training and evaluation, improving efficiency and memory usage.

Building a Simple Linear Model

Pytorch provides us the pythonic way to build models toped with oops. Below lines of codes is essentially just a simple feed forward network. Let’s break it down to understand it better.

First of all, we have inherited nn.Module class, which allows us to use all the attributes present in ‘nn.Module’. The constructor has few parameters which are input_shape, output_shape & hidden_units all these parameters are useful to build the model. We can use nn.Module by importing it from ‘torch’ library

we have defined one attribute called as ‘layer_stacked’ which is just a Sequential layer, and inside the Sequential we have defined a Flatten Layer. The Flatten layer reshapes the input tensor by collapsing all dimensions except the batch dimension into a single dimension, effectively converting 2D tensors into 1D tensors. The result of Flatten Layer is used as an Input to first Linear Layer.

The ‘nn.Linear’ layer expects two parameters: the size of the input tensor and the size of the output tensor. In our model, the input tensor size is determined by the output of the Flatten layer, which flattens the input images into a 1D tensor. The output size of the nn.Linear layer is specified as ‘hidden_units’, which means our model will have a single hidden layer with ‘hidden_units’ number of neurons. Therefore, the nn.Linear layer serves as the hidden layer in our model, responsible for transforming the flattened input features into a higher-dimensional space defined by the ‘hidden_units’ parameter. The last Linear Layer could also be called as a ‘Classifier Layer’ because it is used to classify the output of our Model.

The forward method in PyTorch is used to define the computations that occur when input data is passed through the model, resulting in an output prediction. It specifies how the input data flows through the layers of the model to produce the desired output. So in short, the Tensor ‘x’ when passed through the forward method will go through each and every layer defined in the ‘Sequential’ layer.

class MNISTV0(nn.Module):
  def __init__(self, input_shape:int, output_shape:int, hidden_units:int):
    super().__init__()
    self.layer_stacked = nn.Sequential(
        nn.Flatten(),
        nn.Linear(in_features=input_shape, out_features=hidden_units),
        nn.Linear(in_features=hidden_units, out_features=output_shape)
    )

  # Forward Pass
  def forward(self, x:torch.Tensor):
    return self.layer_stacked(x)

# Creating the Instance of the baseline model
model_v0 = MNISTV0(input_shape=28*28, output_shape=len(class_names), hidden_units=4).to(device)

As the Dataset has multiple classes it is quite clear we have to use CrossEntropy as our model’s loss function. For Optimizer we can use Adam and are using a standard learning rate of 0.01.

model_loss = nn.CrossEntropyLoss()
model_optimizer = torch.optim.Adam(params = model_v0.parameters())

It’s now time to build the Training & Testing Step for our Linear Model. We have generalized it in a function so that we can use the same training & testing method for different models. The method defined requires few parameters which are ‘model’,‘no_of_epochs’, ‘data_loader’, ‘model_loss’,’ ‘model_acc’, ‘model_optimizer’, ‘device’.

In Pytorch, to Train a Model we have to essentially follow this five simple steps:

(i) Forward Pass: Input data is passed through the model to obtain the model’s predictions or logits.

(ii) Calculate the Loss: The predicted logits are compared with the ground truth labels using a loss function, which quantifies the difference between the predicted output and the true target.

(iii) Optimizer Zero Grad: The gradients of the loss with respect to the model parameters (weights and biases) are computed using backpropagation. Before backpropagation, it’s essential to zero out the gradients from the previous iteration to prevent accumulation.

(iv) Backpropagation: The computed gradients are used to update the model’s parameters, adjusting the weights and biases to minimize the loss. This process of propagating gradients backward through the network and updating parameters is known as backpropagation.

(v) Optimizer Step: The optimizer (e.g., SGD, Adam, etc.) uses the computed gradients to update the model’s parameters according to its optimization algorithm. This step involves applying the optimizer’s update rule to adjust the parameters in the direction that minimizes the loss.

from tqdm.auto import tqdm

def train_step(model: torch.nn.Module, no_of_epochs:int, data_loader: torch.utils.data, model_loss:torch.nn.Module, model_acc, model_optimizer:torch.optim, device: torch.device = device):
  """
    Training Step for the Neural Model
  """
  for epoch in tqdm(range(no_of_epochs)):
    print(f"Epochs {epoch} ------------------------- ")
    # Training Mode:
    model.train()
    train_loss, train_acc = 0,0
    for batch, (X,y) in enumerate(data_loader):
      X,y = X.to(device), y.to(device)
      # Forward Pass
      y_logits = model(X)

      # Calculate the Loss
      loss = model_loss(y_logits, y)
      train_loss += loss

      # Caculate the training acc
      train_acc += model_acc(y, y_logits.argmax(dim=1))
      # Optimizer zero grad
      model_optimizer.zero_grad()

      # Loss Backward
      loss.backward()

      # Optmizier step
      model_optimizer.step()

      if batch % 400 == 0:
        print(f"Looked Through {batch * len(X)} / {len(train_data_loader.dataset)} samples")

    # Update the training loss & training accuracy
    train_loss /= len(data_loader)
    train_acc /= len(data_loader)
    print(f"Training Loss {train_loss} Training Accuracy {train_acc}")

Similarly, for Testing as well, we have to follow this simple steps:

from tqdm.auto import tqdm

def test_step(model: torch.nn.Module, data_loader: torch.utils.data, model_loss:torch.nn.Module, model_acc,device: torch.device = device):
  """
    Testing Step for the Neural Model
  """
  # Testing Mode:
  model.eval()
  with torch.inference_mode():
    test_loss, test_acc = 0,0
    for X_test,y_test in tqdm(data_loader):
        # Forward Pass
        X_test, y_test = X_test.to(device), y_test.to(device)
        y_logits = model(X_test)

        # Calculate the Loss
        test_loss += model_loss(y_logits, y_test)

        # Caculate the training acc
        test_acc += model_acc(y_test, y_logits.argmax(dim=1))
      # Update the training loss & training accuracy
    test_loss /= len(data_loader)
    test_acc /= len(data_loader)
    print(f"Training Loss {test_loss:.2f} Training Accuracy {test_acc:.2f}")

Let’s define the accuracy function used to calculate the accuracy percentage of the model. It takes two inputs ‘y_true’ & ‘y_pred’. If you see in the training & testing loop, model_acc function accepts true y label value and a y label value which our model predicted. We will take the sum of equal values between them and calculate the accuracy.

def model_accuracy(y_true, y_pred):
  acc = torch.eq(y_true,y_pred).sum().item()
  return (acc/len(y_true))*100

Let’s train the model and test the result of the simple feed forward network.

train_step(model_v0, 5, train_data_loader, model_loss, model_accuracy, model_optimizer)

test_step(model_v0,test_data_loader, model_loss, model_accuracy)

After training it for 5 epochs, the model achieved a pretty good accuracy of around 85%. We can further tune the hyperparameters, change the optimizer, testing with different learning rate, and adjust the loss function to improve the accuracy. However, we are going to test a non-linear model to see if the accuracy has increased or decreased.

Building a Non-Linear Model

class MNISTV1(nn.Module):
  def __init__(self, input_shape:int, output_shape:int, hidden_units:int):
    super().__init__()
    self.layer_stacked = nn.Sequential(
        nn.Flatten(),
        nn.Linear(in_features=input_shape, out_features=hidden_units),
        nn.ReLU(),
        nn.Linear(in_features=hidden_units, out_features=output_shape),
    )
  # Forward Pass
  def forward(self, x:torch.Tensor):
    return self.layer_stacked(x)

# Creating the Instance of the baseline model
model_v1 = MNISTV1(input_shape=28*28, output_shape=len(class_names), hidden_units=4).to(device)

As you can see, similar to a Linear Model, here we have introduced a non-linear activation function called ReLU(). Now, the model has the capability to learn from non-linear data and pick out the patterns which linear model failed to do so.

Let’s now define the loss, optimizer functions for the model

model_loss = nn.CrossEntropyLoss()
model_optimizer = torch.optim.Adam(params = model_v1.parameters())

On Training the Model for 5 epochs, we can see that the training accuracy is comparatively equal to the Linear Model, on Testing we get around 85.02% accuracy.

**Training Accuracy of Non-Linear Model**

**Testing Accuracy of Non-Linear Model**

Building the CNN Model

In Pytorch, there is a small metrics required in the classifier layer . We need to know the input_shape of the Linear Layer just after the flatten in advance before building the model. Good thing, is that we know one trick that will allow us to know this numerical metric while building the model.

Test CNN Model

class MNISTV2_Test(nn.Module):
  def __init__(self, input_shape:int, output_shape:int, hidden_units:int):
    super().__init__()
    self.conv_2d_layer_1 = nn.Sequential(
        nn.Conv2d(in_channels=input_shape, out_channels=hidden_units, kernel_size=2,stride=1,padding=1),
        nn.ReLU(),
        nn.Conv2d(in_channels=hidden_units, out_channels=hidden_units*4, kernel_size=2,stride=1,padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2)
    )
    self.classifier_layer = nn.Sequential(
        nn.Flatten(),
        nn.Linear(in_features=hidden_units*0,out_features=output_shape)
    )
  # Forward Pass
  def forward(self, x:torch.Tensor):
    x = self.conv_2d_layer_1(x)
    print(f"Shape of X {x.shape}")
    return self.classifier_layer(x)

# Creating the Instance of the baseline model
model_v2_test = MNISTV2_Test(input_shape=1, output_shape=len(class_names), hidden_units=16).to(device)

At start we will keep this metric as ‘0’ (in_features=hidden_units*0) and will keep on printing the shape of X when it passes through each layer sequentially.

Let’s create a random image and pass it through the test model.

test_image = torch.randn(size=(1,28,28))
model_v2(test_image.unsqueeze(0).to(device))

Trick is to take the last two dimensions of X tensor multiply them together which in essence Flatten layer do i.e. it converts 2d input into 1d by multiplying width and height together and since I have multiplied ‘4’ to the hidden_units in the last convolutional layer, I had to include that unit as well. So, in total the input_shape will be ‘hidden_units*4*15*15’.

class MNISTV2(nn.Module):
  def __init__(self, input_shape:int, output_shape:int, hidden_units:int):
    super().__init__()
    self.conv_2d_layer_1 = nn.Sequential(
        nn.Conv2d(in_channels=input_shape, out_channels=hidden_units, kernel_size=2,stride=1,padding=1),
        nn.ReLU(),
        nn.Conv2d(in_channels=hidden_units, out_channels=hidden_units*4, kernel_size=2,stride=1,padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2)
    )
    self.classifier_layer = nn.Sequential(
        nn.Flatten(),
        nn.Linear(in_features=hidden_units*4*15*15,out_features=output_shape)
    )
  # Forward Pass
  def forward(self, x:torch.Tensor):
    x = self.conv_2d_layer_1(x)
    #print(f"Shape of X {x.shape}")
    return self.classifier_layer(x)

# Creating the Instance of the baseline model
model_v2 = MNISTV2(input_shape=1, output_shape=len(class_names), hidden_units=16).to(device)

Let’s define the loss & optimizer function.

model_loss = nn.CrossEntropyLoss()
model_optimizer = torch.optim.Adam(params = model_v2.parameters())

The Training accuracy looks impressive for our Model

train_step(model_v2, 5, train_data_loader, model_loss, model_accuracy, model_optimizer)

The testing accuracy reached nearly 99%…which implies our model is accurately able to learn each and every patterns in the data.

Comparing the Accuracy of Different Models

def eval_model(model:torch.nn.Module, data_loader: torch.utils.data, model_loss:torch.nn.Module, model_acc,device: torch.device = device):
  # Testing Mode:
  model.eval()
  with torch.inference_mode():
    test_loss, test_acc = 0,0
    for X_test,y_test in tqdm(data_loader):
        X_test,y_test = X_test.to(device),y_test.to(device)
        # Forward Pass
        y_logits = model(X_test)

        # Calculate the Loss
        test_loss += model_loss(y_logits, y_test)

        # Caculate the training acc
        test_acc += model_acc(y_test, y_logits.argmax(dim=1))
      # Update the training loss & training accuracy
    test_loss /= len(data_loader)
    test_acc /= len(data_loader)
    return {'model':model.__class__.__name__,'Loss':(test_loss*100).item(),'Accuracy':test_acc}

model_v0_results= eval_model(model_v0,test_data_loader, model_loss, model_accuracy)
model_v1_results= eval_model(model_v1,test_data_loader, model_loss, model_accuracy)
model_v2_results= eval_model(model_v2,test_data_loader, model_loss, model_accuracy)

The above function is used to evaluate all the models we have built so far, below are the results of all the three models.

model_results_df = pd.DataFrame([model_v0_results,model_v1_results,model_v2_results])
model_results_df.set_index('model')['Accuracy'].plot(kind='barh', color='g')
plt.xlabel('Accuracy %')
plt.ylabel('Models')

Making Predictions using our Best Model

test_images_per_batch, test_labels_per_batch = next(iter(test_data_loader))

figure = plt.figure(figsize=(16,8))
rows, cols = 4,4

for i in range(1, rows*cols+1):
  random_index = torch.randint(0, len(test_images_per_batch), size=[1]).item()
  image, label = test_images_per_batch[random_index], test_labels_per_batch[random_index]
  y_logits = model_v2(image.unsqueeze(0).to(device))
  test_prediction_label = y_logits.argmax(dim=1)
  figure.add_subplot(rows,cols,i)
  title = f"Predicted {class_names[test_prediction_label]} | Actual {class_names[label]}"
  plt.imshow(image.squeeze())
  if class_names[test_prediction_label] == class_names[label]:
    plt.title(title,color="g")
  else:
    plt.title(title, color="r")
  plt.axis(False)

The above figure displays the results of our model’s predictions compared to the actual values. It is evident that the model accurately predicted each digit. Also it achieved impressive accuracy in both Training & Testing this suggests that the model achieved low variance and low bias.

Resources

Learn Pytorch

Dataset Resource

https://github.com/Aftabgazali/CNN_On_MNIST_DATASET

Deep Learning for Handwritten Digit Recognition: A PyTorch Approach

Why Pytorch?

Understanding the MNIST Dataset:

Written by Aftab Gazali