diff --git a/Codes/Code-2a.md b/Codes/Code-2a.md new file mode 100644 index 0000000..d7eed4b --- /dev/null +++ b/Codes/Code-2a.md @@ -0,0 +1,219 @@ +# Practical-2a (Classification using Deep Neural Network - OCR Letter Recognition) + +Problem Statement: Multiclass classification using Deep Neural Networks: Example: Use the OCR letter recognition dataset. + +> [!NOTE] +> Dataset available in [Datasets](../Datasets/letter+recognition.zip) directory. + +--- + +## Pre-requisities + +1. Install packages using `pip`: `pip install tensorflow keras numpy pandas matplotlib seaborn scikit-learn` (`tensorflow` requires Python 3.9 - 3.12) +2. Download and unzip the `letter+recognition.zip` dataset in the same directory as the Jupyter notebook. + +## Steps + +1. Import Libraries +2. Load Dataset +3. Exploratory Data Analysis (EDA) +4. Visualize Class Distribution +5. Encode Labels and Separate Features +6. Split into Training and Testing Sets +7. Feature Scaling (Standardization) +8. One-Hot Encode Labels +9. Build the Deep Neural Network Model +10. Compile the Model +11. Train the Model +12. Evaluate the Model on Test Data +13. Plot Training vs Validation Accuracy +14. Plot Training vs Validation Loss +15. Confusion Matrix and Classification Report + +--- + +## Code + +### 1. Import Libraries: + +```python3 +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +import seaborn as sns +from sklearn.model_selection import train_test_split +from sklearn.preprocessing import LabelEncoder, StandardScaler +from sklearn.metrics import confusion_matrix, classification_report +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Input, Dense, Dropout +from tensorflow.keras.utils import to_categorical +``` + +### 2. Load Dataset: + +```python3 +# Dataset has no header row — define column names manually based on UCI documentation +col_names = ['letter', 'x-box', 'y-box', 'width', 'high', 'onpix', + 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', + 'x2ybr', 'xy2br', 'x-ege', 'xegvy', 'y-ege', 'yegvx'] + +data = pd.read_csv('./letter+recognition/letter-recognition.data', header=None, names=col_names) +print("Shape:", data.shape) +print(data.head()) +``` + +### 3. Exploratory Data Analysis (EDA): + +```python3 +print("Data Types:\n", data.dtypes) +print("\nMissing Values:\n", data.isnull().sum()) +print("\nStatistical Summary:\n", data.describe()) +``` + +### 4. Visualize Class Distribution: + +```python3 +plt.figure(figsize=(14, 4)) +data['letter'].value_counts().sort_index().plot(kind='bar') +plt.title("Number of Samples per Letter Class") +plt.xlabel("Letter") +plt.ylabel("Count") +plt.tight_layout() +plt.show() +``` + +### 5. Encode Labels and Separate Features: + +```python3 +label_encoder = LabelEncoder() +data['letter'] = label_encoder.fit_transform(data['letter']) # A=0, B=1, ..., Z=25 + +X = data.drop('letter', axis=1).values # 16 numeric features +y = data['letter'].values # class index 0–25 +num_classes = len(label_encoder.classes_) +print("Classes:", label_encoder.classes_) +print("Number of classes:", num_classes) +``` + +### 6. Split into Training and Testing Sets: + +```python3 +# 80% train, 20% test; stratify ensures balanced class distribution in both sets +X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.2, random_state=42, stratify=y) +print("Train samples:", X_train.shape[0]) +print("Test samples: ", X_test.shape[0]) +``` + +### 7. Feature Scaling (Standardization): + +```python3 +scaler = StandardScaler() +X_train = scaler.fit_transform(X_train) # learn mean/std from train, then scale +X_test = scaler.transform(X_test) # apply same mean/std to test (no leakage) +``` + +### 8. One-Hot Encode Labels: + +```python3 +# e.g. class 2 of 26 -> [0, 0, 1, 0, ..., 0] +y_train_cat = to_categorical(y_train, num_classes) +y_test_cat = to_categorical(y_test, num_classes) +``` + +### 9. Build the Deep Neural Network Model: + +```python3 +model = Sequential() + +model.add(Input(shape=(X_train.shape[1],))) # input: 16 features +model.add(Dense(256, activation='relu')) # hidden layer 1: 256 neurons +model.add(Dropout(0.3)) # drop 30% neurons to reduce overfitting +model.add(Dense(128, activation='relu')) # hidden layer 2: 128 neurons +model.add(Dropout(0.3)) +model.add(Dense(64, activation='relu')) # hidden layer 3: 64 neurons +model.add(Dense(num_classes, activation='softmax')) # output: probability for each of 26 letters + +model.summary() +``` + +### 10. Compile the Model: + +```python3 +# categorical_crossentropy: standard loss for multi-class one-hot classification +model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) +``` + +### 11. Train the Model: + +```python3 +history = model.fit( + X_train, y_train_cat, + epochs=50, + batch_size=32, + validation_split=0.2 # use 20% of training data to monitor val loss each epoch +) +``` + +### 12. Evaluate the Model on Test Data: + +```python3 +loss, accuracy = model.evaluate(X_test, y_test_cat) +print(f"Test Loss: {loss:.4f}") +print(f"Test Accuracy: {accuracy*100:.2f}%") +``` + +### 13. Plot Training vs Validation Accuracy: + +```python3 +plt.plot(history.history['accuracy'], label='Training Accuracy') +plt.plot(history.history['val_accuracy'], label='Validation Accuracy') +plt.title('Model Accuracy Over Epochs') +plt.xlabel('Epoch') +plt.ylabel('Accuracy') +plt.legend() +plt.grid(True) +plt.show() +``` + +### 14. Plot Training vs Validation Loss: + +```python3 +plt.plot(history.history['loss'], label='Training Loss') +plt.plot(history.history['val_loss'], label='Validation Loss') +plt.title('Model Loss Over Epochs') +plt.xlabel('Epoch') +plt.ylabel('Loss') +plt.legend() +plt.grid(True) +plt.show() +``` + +### 15. Confusion Matrix and Classification Report: + +```python3 +y_pred = np.argmax(model.predict(X_test), axis=1) # predicted class index + +cm = confusion_matrix(y_test, y_pred) +plt.figure(figsize=(16, 14)) +sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', + xticklabels=label_encoder.classes_, + yticklabels=label_encoder.classes_) +plt.title('Confusion Matrix') +plt.ylabel('Actual') +plt.xlabel('Predicted') +plt.tight_layout() +plt.show() + +print("\nClassification Report:\n") +print(classification_report(y_test, y_pred, target_names=label_encoder.classes_)) +``` + +--- + +## Miscellaneous + +- [Dataset source](https://archive.ics.uci.edu/ml/datasets/letter%2Brecognition) + +--- +