# Practical-3a (Convolutional Neural Network - Plant Diseases)

Problem Statement: Convolutional Neural Network (CNN): Use any dataset of plant disease and design a plant disease detection system using CNN.

> [!NOTE]
> Download dataset directly from [source](https://www.kaggle.com/datasets/vipoooool/new-plant-diseases-dataset/data).
> Haven't added it to the `/Datasets` directory due to its large size.
> tbh the dataset doesn't really matter in this case, you just need to ensure dataset directory contains `train` and `valid` sub-directories.
> Refer the above dataset to understand the required directory structure.

---

## Pre-requisities

1. Install packages using `pip`: `pip install tensorflow keras numpy opencv-python matplotlib seaborn scikit-learn` (`tensorflow` requires Python 3.9 - 3.12)
2. Download and unzip the dataset in the same directory as the Jupyter notebook.
3. Ensure your unzipped dataset has the required directory structure:

```shell
New Plant Diseases Dataset(Augmented)/
├── train
│   ├── Apple___Apple_scab
│   ├── Apple___Black_rot
│   ├── Apple___Cedar_apple_rust
├── valid
│   ├── Apple___Apple_scab
│   ├── Apple___Black_rot
│   ├── Apple___Cedar_apple_rust
```

## Steps

1. Import Libraries
2. Load Dataset
3. Exploratory Data Analysis (EDA)
4. Split into Training and Testing Sets
5. Build the CNN Model
6. Compile the Model
7. Train the Model
8. Evaluate the Model on Test Data
9. Plot Training vs Validation Accuracy
10. Plot Training vs Validation Loss
11. Confusion Matrix and Classification Report

---

## Code

### 1. Import Libraries:

```python3
import os
import numpy as np
import cv2
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.utils import to_categorical
```

### 2. Load Dataset:

```python3
data = []
labels = []

# Path to dataset folder containing one subfolder per disease class
path = './New Plant Diseases Dataset(Augmented)/train/'
categories = sorted(os.listdir(path))  # sort for consistent label ordering

# Map each category name to a numeric index
label_dict = {category: idx for idx, category in enumerate(categories)}
print("Classes found:", len(categories))

max_per_class = 200  # cap images per class to avoid RAM overflow on large datasets

for category in categories:
    folder = os.path.join(path, category)
    count = 0
    for img_name in os.listdir(folder):
        if count >= max_per_class:
            break
        img_path = os.path.join(folder, img_name)
        img_array = cv2.imread(img_path)
        if img_array is not None:                          # skip unreadable files
            img_array = cv2.resize(img_array, (64, 64))   # resize to fixed 64x64 pixels
            data.append(img_array)
            labels.append(label_dict[category])
            count += 1

data = np.array(data) / 255.0   # normalize pixel values from [0,255] to [0,1]
labels = np.array(labels)
print("Dataset shape:", data.shape)
print("Labels shape:", labels.shape)
```

### 3. Exploratory Data Analysis (EDA):

```python3
print("Total images:", len(data))
print("Image shape:", data[0].shape)
print("Number of classes:", len(categories))

# Class distribution bar chart
class_counts = {cat: int((labels == idx).sum()) for cat, idx in label_dict.items()}
plt.figure(figsize=(14, 5))
plt.bar(class_counts.keys(), class_counts.values())
plt.xticks(rotation=90)
plt.title("Number of Images per Disease Class")
plt.xlabel("Class")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

# Sample images from first 5 classes
plt.figure(figsize=(15, 3))
for i, category in enumerate(categories[:5]):
    idx = np.where(labels == label_dict[category])[0][0]  # index of first image in class
    plt.subplot(1, 5, i + 1)
    plt.imshow(cv2.cvtColor((data[idx] * 255).astype(np.uint8), cv2.COLOR_BGR2RGB))
    plt.title(category[:15], fontsize=8)
    plt.axis('off')
plt.suptitle("Sample Images per Class")
plt.show()
```

### 4. Split into Training and Testing Sets:

```python3
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

num_classes = len(categories)
# One-hot encode labels: e.g. class 2 of 5 → [0, 0, 1, 0, 0]
y_train = to_categorical(y_train, num_classes)
y_test  = to_categorical(y_test,  num_classes)
print("Train samples:", X_train.shape[0])
print("Test samples: ", X_test.shape[0])
```

### 5. Build the CNN Model:

```python3
model = Sequential()

model.add(Input(shape=(64, 64, 3)))               # input: 64x64 RGB image
model.add(Conv2D(32, (3, 3), activation='relu'))  # 32 filters, detect basic features
model.add(MaxPooling2D(2, 2))                      # downsample by 2x

model.add(Conv2D(64, (3, 3), activation='relu'))  # 64 filters, detect complex features
model.add(MaxPooling2D(2, 2))

model.add(Flatten())                              # convert 2D feature maps to 1D vector

model.add(Dense(128, activation='relu'))          # fully connected layer
model.add(Dropout(0.5))                           # randomly drop 50% neurons to reduce overfitting

model.add(Dense(num_classes, activation='softmax'))  # output: probability for each class

model.summary()
```

### 6. Compile the Model:

```python3
# categorical_crossentropy: standard loss for multi-class classification with one-hot labels
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
```

### 7. Train the Model:

```python3
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)
```

### 8. Evaluate the Model on Test Data:

```python3
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy*100:.2f}%")
```

### 9. Plot Training vs Validation Accuracy:

```python3
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('CNN Model Accuracy Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()
```

### 10. Plot Training vs Validation Loss:

```python3
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('CNN Model Loss Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()
```

### 11. Confusion Matrix and Classification Report:

```python3
y_pred = np.argmax(model.predict(X_test), axis=1)  # predicted class index
y_true = np.argmax(y_test, axis=1)                  # actual class index (from one-hot)

cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(14, 12))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=categories, yticklabels=categories)
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

print("\nClassification Report:\n")
print(classification_report(y_true, y_pred, target_names=categories))
```

---

## Miscellaneous

- [Dataset source](https://www.kaggle.com/datasets/vipoooool/new-plant-diseases-dataset)

---