Files

6.1 KiB
Raw Permalink Blame History

Practical-2a (Classification using Deep Neural Network - OCR Letter Recognition)

Problem Statement: Multiclass classification using Deep Neural Networks: Example: Use the OCR letter recognition dataset.

Note

Dataset available in Datasets directory.


Pre-requisities

  1. Install packages using pip: pip install tensorflow keras numpy pandas matplotlib seaborn scikit-learn (tensorflow requires Python 3.9 - 3.12)
  2. Download and unzip the letter+recognition.zip dataset in the same directory as the Jupyter notebook.

Steps

  1. Import Libraries
  2. Load Dataset
  3. Exploratory Data Analysis (EDA)
  4. Visualize Class Distribution
  5. Encode Labels and Separate Features
  6. Split into Training and Testing Sets
  7. Feature Scaling (Standardization)
  8. One-Hot Encode Labels
  9. Build the Deep Neural Network Model
  10. Compile the Model
  11. Train the Model
  12. Evaluate the Model on Test Data
  13. Plot Training vs Validation Accuracy
  14. Plot Training vs Validation Loss
  15. Confusion Matrix and Classification Report

Code

1. Import Libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.utils import to_categorical

2. Load Dataset:

# Dataset has no header row — define column names manually based on UCI documentation
col_names = ['letter', 'x-box', 'y-box', 'width', 'high', 'onpix',
             'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar',
             'x2ybr', 'xy2br', 'x-ege', 'xegvy', 'y-ege', 'yegvx']

data = pd.read_csv('./letter+recognition/letter-recognition.data', header=None, names=col_names)
print("Shape:", data.shape)
print(data.head())

3. Exploratory Data Analysis (EDA):

print("Data Types:\n", data.dtypes)
print("\nMissing Values:\n", data.isnull().sum())
print("\nStatistical Summary:\n", data.describe())

4. Visualize Class Distribution:

plt.figure(figsize=(14, 4))
data['letter'].value_counts().sort_index().plot(kind='bar')
plt.title("Number of Samples per Letter Class")
plt.xlabel("Letter")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

5. Encode Labels and Separate Features:

label_encoder = LabelEncoder()
data['letter'] = label_encoder.fit_transform(data['letter'])  # A=0, B=1, ..., Z=25

X = data.drop('letter', axis=1).values   # 16 numeric features
y = data['letter'].values                 # class index 025
num_classes = len(label_encoder.classes_)
print("Classes:", label_encoder.classes_)
print("Number of classes:", num_classes)

6. Split into Training and Testing Sets:

# 80% train, 20% test; stratify ensures balanced class distribution in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)
print("Train samples:", X_train.shape[0])
print("Test samples: ", X_test.shape[0])

7. Feature Scaling (Standardization):

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # learn mean/std from train, then scale
X_test  = scaler.transform(X_test)       # apply same mean/std to test (no leakage)

8. One-Hot Encode Labels:

# e.g. class 2 of 26 -> [0, 0, 1, 0, ..., 0]
y_train_cat = to_categorical(y_train, num_classes)
y_test_cat  = to_categorical(y_test,  num_classes)

9. Build the Deep Neural Network Model:

model = Sequential()

model.add(Input(shape=(X_train.shape[1],)))    # input: 16 features
model.add(Dense(256, activation='relu'))        # hidden layer 1: 256 neurons
model.add(Dropout(0.3))                         # drop 30% neurons to reduce overfitting
model.add(Dense(128, activation='relu'))        # hidden layer 2: 128 neurons
model.add(Dropout(0.3))
model.add(Dense(64, activation='relu'))         # hidden layer 3: 64 neurons
model.add(Dense(num_classes, activation='softmax'))  # output: probability for each of 26 letters

model.summary()

10. Compile the Model:

# categorical_crossentropy: standard loss for multi-class one-hot classification
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

11. Train the Model:

history = model.fit(
    X_train, y_train_cat,
    epochs=50,
    batch_size=32,
    validation_split=0.2   # use 20% of training data to monitor val loss each epoch
)

12. Evaluate the Model on Test Data:

loss, accuracy = model.evaluate(X_test, y_test_cat)
print(f"Test Loss:     {loss:.4f}")
print(f"Test Accuracy: {accuracy*100:.2f}%")

13. Plot Training vs Validation Accuracy:

plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

14. Plot Training vs Validation Loss:

plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

15. Confusion Matrix and Classification Report:

y_pred = np.argmax(model.predict(X_test), axis=1)  # predicted class index

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(16, 14))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=label_encoder.classes_,
            yticklabels=label_encoder.classes_)
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Miscellaneous