Files

115 lines
2.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Practical-2 (Spam Email Detection)
Problem Statement: Classify the email using the binary classification method. Email Spam detection has two states: a) Normal State Not Spam, b) Abnormal State Spam. Use K-Nearest Neighbors and Support Vector Machine for classification. Analyze their performance.
> [!NOTE]
> Dataset available in [Datasets](../Datasets/emails.csv) directory.
---
## Steps
1. Import libraries
2. Load dataset
3. Data splitting (training and testing)
4. KNN
5. SVM
6. Plotting
---
## Code
### 1. Import libraries:
```python3
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
```
### 2. Load dataset:
```python3
df = pd.read_csv("emails.csv", encoding="ISO-8859-1") # Adjust path if needed
# Drop unnecessary columns if present
if "Email No." in df.columns:
df = df.drop(columns=["Email No."])
# Ensure label is integer
df["Prediction"] = df["Prediction"].astype(int)
# Features & target
X = df.drop(columns=["Prediction"])
y = df["Prediction"]
# Print basic info
print(df.columns)
print(df.head(5))
```
### 3. Data splitting (training and testing):
```python3
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
```
### 4. KNN:
```python3
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print("\n--- KNN Performance ---")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print("Classification Report:\n", classification_report(y_test, y_pred_knn))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))
```
### 5. SVM:
```python3
svm = SVC(kernel='linear', random_state=42) # Linear kernel for binary classification
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
print("\n--- SVM Performance ---")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Classification Report:\n", classification_report(y_test, y_pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
```
### 6. Plotting:
```python3
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
sns.heatmap(confusion_matrix(y_test, y_pred_knn), annot=True, fmt="d", cmap="Blues", ax=ax[0])
ax[0].set_title("KNN Confusion Matrix")
ax[0].set_xlabel("Predicted")
ax[0].set_ylabel("Actual")
sns.heatmap(confusion_matrix(y_test, y_pred_svm), annot=True, fmt="d", cmap="Greens", ax=ax[1])
ax[1].set_title("SVM Confusion Matrix")
ax[1].set_xlabel("Predicted")
ax[1].set_ylabel("Actual")
plt.show()
```
---
## Miscellaneous
- [Dataset source](https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv)
---