220 lines
6.1 KiB
Markdown
220 lines
6.1 KiB
Markdown
# Practical-2a (Classification using Deep Neural Network - OCR Letter Recognition)
|
||
|
||
Problem Statement: Multiclass classification using Deep Neural Networks: Example: Use the OCR letter recognition dataset.
|
||
|
||
> [!NOTE]
|
||
> Dataset available in [Datasets](../Datasets/letter+recognition.zip) directory.
|
||
|
||
---
|
||
|
||
## Pre-requisities
|
||
|
||
1. Install packages using `pip`: `pip install tensorflow keras numpy pandas matplotlib seaborn scikit-learn` (`tensorflow` requires Python 3.9 - 3.12)
|
||
2. Download and unzip the `letter+recognition.zip` dataset in the same directory as the Jupyter notebook.
|
||
|
||
## Steps
|
||
|
||
1. Import Libraries
|
||
2. Load Dataset
|
||
3. Exploratory Data Analysis (EDA)
|
||
4. Visualize Class Distribution
|
||
5. Encode Labels and Separate Features
|
||
6. Split into Training and Testing Sets
|
||
7. Feature Scaling (Standardization)
|
||
8. One-Hot Encode Labels
|
||
9. Build the Deep Neural Network Model
|
||
10. Compile the Model
|
||
11. Train the Model
|
||
12. Evaluate the Model on Test Data
|
||
13. Plot Training vs Validation Accuracy
|
||
14. Plot Training vs Validation Loss
|
||
15. Confusion Matrix and Classification Report
|
||
|
||
---
|
||
|
||
## Code
|
||
|
||
### 1. Import Libraries:
|
||
|
||
```python3
|
||
import numpy as np
|
||
import pandas as pd
|
||
import matplotlib.pyplot as plt
|
||
import seaborn as sns
|
||
from sklearn.model_selection import train_test_split
|
||
from sklearn.preprocessing import LabelEncoder, StandardScaler
|
||
from sklearn.metrics import confusion_matrix, classification_report
|
||
from tensorflow.keras.models import Sequential
|
||
from tensorflow.keras.layers import Input, Dense, Dropout
|
||
from tensorflow.keras.utils import to_categorical
|
||
```
|
||
|
||
### 2. Load Dataset:
|
||
|
||
```python3
|
||
# Dataset has no header row — define column names manually based on UCI documentation
|
||
col_names = ['letter', 'x-box', 'y-box', 'width', 'high', 'onpix',
|
||
'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar',
|
||
'x2ybr', 'xy2br', 'x-ege', 'xegvy', 'y-ege', 'yegvx']
|
||
|
||
data = pd.read_csv('./letter+recognition/letter-recognition.data', header=None, names=col_names)
|
||
print("Shape:", data.shape)
|
||
print(data.head())
|
||
```
|
||
|
||
### 3. Exploratory Data Analysis (EDA):
|
||
|
||
```python3
|
||
print("Data Types:\n", data.dtypes)
|
||
print("\nMissing Values:\n", data.isnull().sum())
|
||
print("\nStatistical Summary:\n", data.describe())
|
||
```
|
||
|
||
### 4. Visualize Class Distribution:
|
||
|
||
```python3
|
||
plt.figure(figsize=(14, 4))
|
||
data['letter'].value_counts().sort_index().plot(kind='bar')
|
||
plt.title("Number of Samples per Letter Class")
|
||
plt.xlabel("Letter")
|
||
plt.ylabel("Count")
|
||
plt.tight_layout()
|
||
plt.show()
|
||
```
|
||
|
||
### 5. Encode Labels and Separate Features:
|
||
|
||
```python3
|
||
label_encoder = LabelEncoder()
|
||
data['letter'] = label_encoder.fit_transform(data['letter']) # A=0, B=1, ..., Z=25
|
||
|
||
X = data.drop('letter', axis=1).values # 16 numeric features
|
||
y = data['letter'].values # class index 0–25
|
||
num_classes = len(label_encoder.classes_)
|
||
print("Classes:", label_encoder.classes_)
|
||
print("Number of classes:", num_classes)
|
||
```
|
||
|
||
### 6. Split into Training and Testing Sets:
|
||
|
||
```python3
|
||
# 80% train, 20% test; stratify ensures balanced class distribution in both sets
|
||
X_train, X_test, y_train, y_test = train_test_split(
|
||
X, y, test_size=0.2, random_state=42, stratify=y)
|
||
print("Train samples:", X_train.shape[0])
|
||
print("Test samples: ", X_test.shape[0])
|
||
```
|
||
|
||
### 7. Feature Scaling (Standardization):
|
||
|
||
```python3
|
||
scaler = StandardScaler()
|
||
X_train = scaler.fit_transform(X_train) # learn mean/std from train, then scale
|
||
X_test = scaler.transform(X_test) # apply same mean/std to test (no leakage)
|
||
```
|
||
|
||
### 8. One-Hot Encode Labels:
|
||
|
||
```python3
|
||
# e.g. class 2 of 26 -> [0, 0, 1, 0, ..., 0]
|
||
y_train_cat = to_categorical(y_train, num_classes)
|
||
y_test_cat = to_categorical(y_test, num_classes)
|
||
```
|
||
|
||
### 9. Build the Deep Neural Network Model:
|
||
|
||
```python3
|
||
model = Sequential()
|
||
|
||
model.add(Input(shape=(X_train.shape[1],))) # input: 16 features
|
||
model.add(Dense(256, activation='relu')) # hidden layer 1: 256 neurons
|
||
model.add(Dropout(0.3)) # drop 30% neurons to reduce overfitting
|
||
model.add(Dense(128, activation='relu')) # hidden layer 2: 128 neurons
|
||
model.add(Dropout(0.3))
|
||
model.add(Dense(64, activation='relu')) # hidden layer 3: 64 neurons
|
||
model.add(Dense(num_classes, activation='softmax')) # output: probability for each of 26 letters
|
||
|
||
model.summary()
|
||
```
|
||
|
||
### 10. Compile the Model:
|
||
|
||
```python3
|
||
# categorical_crossentropy: standard loss for multi-class one-hot classification
|
||
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
|
||
```
|
||
|
||
### 11. Train the Model:
|
||
|
||
```python3
|
||
history = model.fit(
|
||
X_train, y_train_cat,
|
||
epochs=50,
|
||
batch_size=32,
|
||
validation_split=0.2 # use 20% of training data to monitor val loss each epoch
|
||
)
|
||
```
|
||
|
||
### 12. Evaluate the Model on Test Data:
|
||
|
||
```python3
|
||
loss, accuracy = model.evaluate(X_test, y_test_cat)
|
||
print(f"Test Loss: {loss:.4f}")
|
||
print(f"Test Accuracy: {accuracy*100:.2f}%")
|
||
```
|
||
|
||
### 13. Plot Training vs Validation Accuracy:
|
||
|
||
```python3
|
||
plt.plot(history.history['accuracy'], label='Training Accuracy')
|
||
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
|
||
plt.title('Model Accuracy Over Epochs')
|
||
plt.xlabel('Epoch')
|
||
plt.ylabel('Accuracy')
|
||
plt.legend()
|
||
plt.grid(True)
|
||
plt.show()
|
||
```
|
||
|
||
### 14. Plot Training vs Validation Loss:
|
||
|
||
```python3
|
||
plt.plot(history.history['loss'], label='Training Loss')
|
||
plt.plot(history.history['val_loss'], label='Validation Loss')
|
||
plt.title('Model Loss Over Epochs')
|
||
plt.xlabel('Epoch')
|
||
plt.ylabel('Loss')
|
||
plt.legend()
|
||
plt.grid(True)
|
||
plt.show()
|
||
```
|
||
|
||
### 15. Confusion Matrix and Classification Report:
|
||
|
||
```python3
|
||
y_pred = np.argmax(model.predict(X_test), axis=1) # predicted class index
|
||
|
||
cm = confusion_matrix(y_test, y_pred)
|
||
plt.figure(figsize=(16, 14))
|
||
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
|
||
xticklabels=label_encoder.classes_,
|
||
yticklabels=label_encoder.classes_)
|
||
plt.title('Confusion Matrix')
|
||
plt.ylabel('Actual')
|
||
plt.xlabel('Predicted')
|
||
plt.tight_layout()
|
||
plt.show()
|
||
|
||
print("\nClassification Report:\n")
|
||
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))
|
||
```
|
||
|
||
---
|
||
|
||
## Miscellaneous
|
||
|
||
- [Dataset source](https://archive.ics.uci.edu/ml/datasets/letter%2Brecognition)
|
||
|
||
---
|
||
|