Compare commits
25 Commits
1ba1b48ff8
..
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
74f3731e77
|
|||
|
e7cae4d012
|
|||
|
7acb4c9f4c
|
|||
|
63dd320c66
|
|||
|
8619784c68
|
|||
|
124f6b7878
|
|||
|
d1004db0af
|
|||
|
c708ed57ad
|
|||
|
c313cb0ec9
|
|||
|
a588629812
|
|||
|
c0c22a12e7
|
|||
|
95f1dcc828
|
|||
|
ff7638bd70
|
|||
|
1432c59bc4
|
|||
|
bb9a370a98
|
|||
|
c4c460a81f
|
|||
|
a0d06838c2
|
|||
|
14d70779e9
|
|||
|
8cf306ce2a
|
|||
|
0e97128ff2
|
|||
|
5df648ac33
|
|||
|
e5347feffc
|
|||
|
9c83c1aab2
|
|||
|
fc3b508b39
|
|||
|
ce5c95856d
|
+189
@@ -0,0 +1,189 @@
|
||||
# Practical-1 (Uber)
|
||||
|
||||
Problem Statement: Predict the price of the Uber ride from a given pickup point to the agreed drop-off location.
|
||||
Perform following tasks:
|
||||
1. Pre-process the dataset.
|
||||
2. Identify outliers.
|
||||
3. Check the correlation.
|
||||
4. Implement linear regression and random forest regression models.
|
||||
5. Evaluate the models and compare their respective scores like R2, RMSE, etc.
|
||||
|
||||
> [!NOTE]
|
||||
> Dataset available in [Datasets](../Datasets/uber.csv) directory.
|
||||
|
||||
---
|
||||
|
||||
## Steps
|
||||
|
||||
1. Importing Libraries
|
||||
1. Data Loading and Pre-processing
|
||||
2. Outlier Detection
|
||||
3. Correlation Analysis
|
||||
4. Model Implementation (Linear Regression & Random Forest)
|
||||
5. Model Evaluation and Comparison
|
||||
|
||||
---
|
||||
|
||||
## Code
|
||||
|
||||
### 0. Importing Libraries:
|
||||
|
||||
```python3
|
||||
# Import necessary libraries
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.linear_model import LinearRegression
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
|
||||
from math import radians, cos, sin, asin, sqrt
|
||||
```
|
||||
|
||||
### 1. Data Loading & Preprocessing:
|
||||
|
||||
```python3
|
||||
# Load the dataset
|
||||
df = pd.read_csv("uber.csv") # change to your local path if needed
|
||||
print("Initial Data Shape:", df.shape)
|
||||
print(df.head())
|
||||
|
||||
# Drop rows with missing values
|
||||
df.dropna(inplace=True)
|
||||
print("After dropping missing values:", df.shape)
|
||||
|
||||
# Rename columns for easier reference
|
||||
df.rename(columns={'pickup_datetime': 'pickup_datetime'}, inplace=True)
|
||||
|
||||
# Convert pickup_datetime to datetime object
|
||||
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], errors='coerce')
|
||||
|
||||
# Extract useful datetime features
|
||||
df['hour'] = df['pickup_datetime'].dt.hour
|
||||
df['day'] = df['pickup_datetime'].dt.day
|
||||
df['month'] = df['pickup_datetime'].dt.month
|
||||
df['year'] = df['pickup_datetime'].dt.year
|
||||
df['day_of_week'] = df['pickup_datetime'].dt.dayofweek
|
||||
|
||||
# Drop datetime column (not needed as a direct feature)
|
||||
df.drop(['pickup_datetime', 'key'], axis=1, inplace=True, errors='ignore')
|
||||
|
||||
print("\nColumns after feature extraction:\n", df.columns)
|
||||
```
|
||||
|
||||
### 2. Outlier Detection & Removal:
|
||||
|
||||
```python3
|
||||
# Remove entries with unrealistic fares
|
||||
df = df[(df['fare_amount'] > 0) & (df['fare_amount'] < 100)]
|
||||
|
||||
# Remove unrealistic latitude and longitude values
|
||||
df = df[(df['pickup_latitude'] <= 90) & (df['pickup_latitude'] >= -90)]
|
||||
df = df[(df['dropoff_latitude'] <= 90) & (df['dropoff_latitude'] >= -90)]
|
||||
df = df[(df['pickup_longitude'] <= 180) & (df['pickup_longitude'] >= -180)]
|
||||
df = df[(df['dropoff_longitude'] <= 180) & (df['dropoff_longitude'] >= -180)]
|
||||
|
||||
print("Data shape after removing outliers:", df.shape)
|
||||
```
|
||||
|
||||
### 3. Feature Engineering - Distance Calculation:
|
||||
|
||||
```python3
|
||||
# Define Haversine function to calculate distance between pickup and drop-off
|
||||
def haversine(lat1, lon1, lat2, lon2):
|
||||
# convert decimal degrees to radians
|
||||
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
|
||||
# haversine formula
|
||||
dlon = lon2 - lon1
|
||||
dlat = lat2 - lat1
|
||||
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
|
||||
c = 2 * asin(sqrt(a))
|
||||
km = 6371 * c
|
||||
return km
|
||||
|
||||
# Apply the Haversine formula
|
||||
df['distance_km'] = df.apply(lambda x: haversine(x['pickup_latitude'], x['pickup_longitude'],
|
||||
x['dropoff_latitude'], x['dropoff_longitude']), axis=1)
|
||||
|
||||
# Remove zero-distance trips
|
||||
df = df[df['distance_km'] > 0]
|
||||
```
|
||||
|
||||
### 4. Correlation Analysis:
|
||||
|
||||
```python3
|
||||
plt.figure(figsize=(10, 6))
|
||||
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
|
||||
plt.title("Feature Correlation Heatmap")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### 5. Model Training:
|
||||
|
||||
```python3
|
||||
# Define features and target
|
||||
X = df[['distance_km', 'hour', 'day', 'month', 'year', 'day_of_week']]
|
||||
y = df['fare_amount']
|
||||
|
||||
# Split data into train and test sets
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
|
||||
|
||||
# -------------------- Linear Regression --------------------
|
||||
lr_model = LinearRegression()
|
||||
lr_model.fit(X_train, y_train)
|
||||
y_pred_lr = lr_model.predict(X_test)
|
||||
|
||||
# -------------------- Random Forest Regression --------------------
|
||||
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
|
||||
rf_model.fit(X_train, y_train)
|
||||
y_pred_rf = rf_model.predict(X_test)
|
||||
```
|
||||
|
||||
### 6. Model Evaluation:
|
||||
|
||||
```python3
|
||||
def evaluate_model(y_true, y_pred, model_name):
|
||||
r2 = r2_score(y_true, y_pred)
|
||||
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
|
||||
mae = mean_absolute_error(y_true, y_pred)
|
||||
print(f"\nModel: {model_name}")
|
||||
print(f"R² Score: {r2:.4f}")
|
||||
print(f"RMSE: {rmse:.4f}")
|
||||
print(f"MAE: {mae:.4f}")
|
||||
return r2, rmse, mae
|
||||
|
||||
# Evaluate both models
|
||||
lr_scores = evaluate_model(y_test, y_pred_lr, "Linear Regression")
|
||||
rf_scores = evaluate_model(y_test, y_pred_rf, "Random Forest Regressor")
|
||||
```
|
||||
|
||||
### 7. Comparison:
|
||||
|
||||
```python3
|
||||
results = pd.DataFrame({
|
||||
'Model': ['Linear Regression', 'Random Forest Regressor'],
|
||||
'R2': [lr_scores[0], rf_scores[0]],
|
||||
'RMSE': [lr_scores[1], rf_scores[1]],
|
||||
'MAE': [lr_scores[2], rf_scores[2]]
|
||||
})
|
||||
|
||||
print("\nModel Comparison:")
|
||||
print(results)
|
||||
```
|
||||
|
||||
```python3
|
||||
# Plot comparison
|
||||
plt.figure(figsize=(8,5))
|
||||
sns.barplot(x='Model', y='R2', data=results)
|
||||
plt.title("R² Score Comparison between Models")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Miscellaneous
|
||||
|
||||
- [Dataset source](https://www.kaggle.com/datasets/yasserh/uber-fares-dataset)
|
||||
|
||||
---
|
||||
+114
@@ -0,0 +1,114 @@
|
||||
# Practical-2 (Spam Email Detection)
|
||||
|
||||
Problem Statement: Classify the email using the binary classification method. Email Spam detection has two states: a) Normal State – Not Spam, b) Abnormal State – Spam. Use K-Nearest Neighbors and Support Vector Machine for classification. Analyze their performance.
|
||||
|
||||
> [!NOTE]
|
||||
> Dataset available in [Datasets](../Datasets/emails.csv) directory.
|
||||
|
||||
---
|
||||
|
||||
## Steps
|
||||
|
||||
1. Import libraries
|
||||
2. Load dataset
|
||||
3. Data splitting (training and testing)
|
||||
4. KNN
|
||||
5. SVM
|
||||
6. Plotting
|
||||
|
||||
---
|
||||
|
||||
## Code
|
||||
|
||||
### 1. Import libraries:
|
||||
|
||||
```python3
|
||||
import pandas as pd
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
from sklearn.svm import SVC
|
||||
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
```
|
||||
|
||||
### 2. Load dataset:
|
||||
|
||||
```python3
|
||||
df = pd.read_csv("emails.csv", encoding="ISO-8859-1") # Adjust path if needed
|
||||
|
||||
# Drop unnecessary columns if present
|
||||
if "Email No." in df.columns:
|
||||
df = df.drop(columns=["Email No."])
|
||||
|
||||
# Ensure label is integer
|
||||
df["Prediction"] = df["Prediction"].astype(int)
|
||||
|
||||
# Features & target
|
||||
X = df.drop(columns=["Prediction"])
|
||||
y = df["Prediction"]
|
||||
|
||||
# Print basic info
|
||||
print(df.columns)
|
||||
print(df.head(5))
|
||||
```
|
||||
|
||||
### 3. Data splitting (training and testing):
|
||||
|
||||
```python3
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42, stratify=y
|
||||
)
|
||||
```
|
||||
|
||||
### 4. KNN:
|
||||
|
||||
```python3
|
||||
knn = KNeighborsClassifier(n_neighbors=5)
|
||||
knn.fit(X_train, y_train)
|
||||
y_pred_knn = knn.predict(X_test)
|
||||
|
||||
print("\n--- KNN Performance ---")
|
||||
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
|
||||
print("Classification Report:\n", classification_report(y_test, y_pred_knn))
|
||||
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))
|
||||
```
|
||||
|
||||
### 5. SVM:
|
||||
|
||||
```python3
|
||||
svm = SVC(kernel='linear', random_state=42) # Linear kernel for binary classification
|
||||
svm.fit(X_train, y_train)
|
||||
y_pred_svm = svm.predict(X_test)
|
||||
|
||||
print("\n--- SVM Performance ---")
|
||||
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
|
||||
print("Classification Report:\n", classification_report(y_test, y_pred_svm))
|
||||
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
|
||||
```
|
||||
|
||||
### 6. Plotting:
|
||||
|
||||
```python3
|
||||
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
|
||||
|
||||
sns.heatmap(confusion_matrix(y_test, y_pred_knn), annot=True, fmt="d", cmap="Blues", ax=ax[0])
|
||||
ax[0].set_title("KNN Confusion Matrix")
|
||||
ax[0].set_xlabel("Predicted")
|
||||
ax[0].set_ylabel("Actual")
|
||||
|
||||
sns.heatmap(confusion_matrix(y_test, y_pred_svm), annot=True, fmt="d", cmap="Greens", ax=ax[1])
|
||||
ax[1].set_title("SVM Confusion Matrix")
|
||||
ax[1].set_xlabel("Predicted")
|
||||
ax[1].set_ylabel("Actual")
|
||||
|
||||
plt.show()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Miscellaneous
|
||||
|
||||
- [Dataset source](https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv)
|
||||
|
||||
---
|
||||
@@ -0,0 +1,82 @@
|
||||
# Practical-4 (Gradient Descent Algorithm)
|
||||
|
||||
Problem Statement: Implement Gradient Descent Algorithm to find the local minima of a function. For example, find the local minima of the function y=(x+3)² starting from the point x=2.
|
||||
|
||||
---
|
||||
|
||||
## Steps
|
||||
|
||||
1. Define the function and its derivative
|
||||
2. Initialize parameters for Gradient Descent
|
||||
3. Gradient Descent Loop
|
||||
4. Print the result
|
||||
5. Plotting
|
||||
|
||||
---
|
||||
|
||||
## Code
|
||||
|
||||
### 0. Import libraries:
|
||||
|
||||
```python3
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
```
|
||||
|
||||
### 1. Define the function and its derivative:
|
||||
|
||||
```python3
|
||||
def f(x):
|
||||
return (x + 3)**2
|
||||
|
||||
def grad_f(x):
|
||||
return 2 * (x + 3) # derivative of f(x)
|
||||
```
|
||||
|
||||
### 2. Initialize parameters for Gradient Descent:
|
||||
|
||||
```python3
|
||||
x_current = 2 # starting point
|
||||
learning_rate = 0.1 # step size
|
||||
tolerance = 1e-6 # convergence tolerance
|
||||
max_iterations = 25 # maximum iterations
|
||||
history = [x_current] # sotring history
|
||||
```
|
||||
|
||||
### 3. Gradient Descent Loop:
|
||||
|
||||
```python3
|
||||
for i in range(max_iterations):
|
||||
gradient = grad_f(x_current)
|
||||
x_next = x_current - learning_rate * gradient # update step
|
||||
|
||||
# Check convergence
|
||||
if abs(x_next - x_current) < tolerance:
|
||||
print(f"Converged after {i+1} iterations.")
|
||||
break
|
||||
|
||||
x_current = x_next
|
||||
history.append(x_current)
|
||||
print(f"Iteration {i+1}: x = {x_current:.4f}, f(x) = {f(x_current):.4f}")
|
||||
```
|
||||
|
||||
### 4. Print the result:
|
||||
|
||||
```python3
|
||||
print("Local minima at x =", x_current)
|
||||
print("Function value at local minima y =", f(x_current))
|
||||
```
|
||||
|
||||
### 5. Plotting:
|
||||
|
||||
```python3
|
||||
plt.plot(history, [f(val) for val in history], marker='o')
|
||||
plt.xlabel("x values")
|
||||
plt.ylabel("f(x)")
|
||||
plt.title("Gradient Descent Convergence")
|
||||
plt.grid()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
+121
@@ -0,0 +1,121 @@
|
||||
# Practical-6 (Clustering)
|
||||
|
||||
Problem Statement: Implement K-Means clustering/ hierarchical clustering on `sales_data_sample.csv` dataset. Determine the number of clusters using the elbow method.
|
||||
|
||||
> [!NOTE]
|
||||
> Dataset available in [Datasets](../Datasets/sales_data_sample.csv) directory.
|
||||
|
||||
---
|
||||
|
||||
## Steps
|
||||
|
||||
1. Import libraries
|
||||
2. Load dataset
|
||||
3. Select numerical features for clustering
|
||||
4. Standarize data
|
||||
5. K-Means clustering
|
||||
6. Hierarchical clustering
|
||||
|
||||
---
|
||||
|
||||
## Code
|
||||
|
||||
### 1. Import libraries:
|
||||
|
||||
```python3
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.cluster import KMeans
|
||||
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
|
||||
import seaborn as sns
|
||||
```
|
||||
|
||||
### 2. Load dataset:
|
||||
|
||||
```python3
|
||||
df = pd.read_csv("sales_data_sample.csv", encoding='latin1', on_bad_lines='skip')
|
||||
print("Dataset shape:", df.shape)
|
||||
print(df.head())
|
||||
```
|
||||
|
||||
### 3. Select numerical features for clustering:
|
||||
|
||||
```python3
|
||||
X = df.select_dtypes(include=['int64', 'float64'])
|
||||
print("Features used for clustering:\n", X.head())
|
||||
|
||||
# Select relevant numeric columns
|
||||
# X = df[['SALES', 'QUANTITYORDERED', 'PRICEEACH']]
|
||||
|
||||
# Handle missing values if any
|
||||
# X = features.dropna()
|
||||
```
|
||||
|
||||
### 4. Standardize data:
|
||||
|
||||
```python3
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
```
|
||||
|
||||
### 5. K-Means clustering:
|
||||
|
||||
```python3
|
||||
# Determine optimal number of clusters using Elbow Method
|
||||
wcss = []
|
||||
for k in range(1, 11):
|
||||
kmeans = KMeans(n_clusters=k, random_state=42)
|
||||
kmeans.fit(X_scaled)
|
||||
wcss.append(kmeans.inertia_)
|
||||
|
||||
# Plot Elbow Method
|
||||
plt.figure(figsize=(6,4))
|
||||
plt.plot(range(1, 11), wcss, marker='o')
|
||||
plt.title('Elbow Method')
|
||||
plt.xlabel('Number of clusters (k)')
|
||||
plt.ylabel('Inertia (WCSS)')
|
||||
plt.show()
|
||||
|
||||
# Fit KMeans with chosen number of clusters (example: 3 clusters)
|
||||
kmeans = KMeans(n_clusters=3, random_state=42) # Add n_init=10 param in the function to suppress warnings
|
||||
clusters_kmeans = kmeans.fit_predict(X_scaled)
|
||||
df['KMeans_Cluster'] = clusters_kmeans
|
||||
|
||||
# Visualize clusters
|
||||
sns.scatterplot(x='SALES', y='PRICEEACH', hue='KMeans_Cluster', data=df, palette='viridis')
|
||||
plt.title("K-Means Clustering")
|
||||
plt.show()
|
||||
|
||||
print("\nK-Means Cluster Centers:\n", kmeans.cluster_centers_)
|
||||
print("\nCluster counts:\n", df['KMeans_Cluster'].value_counts())
|
||||
```
|
||||
|
||||
### 6. Hierarchical clustering:
|
||||
|
||||
```python3
|
||||
# Create linkage matrix
|
||||
Z = linkage(X_scaled, method='ward')
|
||||
|
||||
# Plot dendrogram
|
||||
plt.figure(figsize=(10,5))
|
||||
dendrogram(Z)
|
||||
plt.title('Hierarchical Clustering Dendrogram')
|
||||
plt.xlabel('Samples')
|
||||
plt.ylabel('Distance')
|
||||
plt.show()
|
||||
|
||||
# Assign clusters (example: 3 clusters)
|
||||
clusters_hier = fcluster(Z, t=3, criterion='maxclust')
|
||||
df['Hierarchical_Cluster'] = clusters_hier
|
||||
|
||||
print("\nHierarchical Cluster counts:\n", pd.Series(clusters_hier).value_counts())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Miscellaneous
|
||||
|
||||
- [Dataset source](https://www.kaggle.com/datasets/kyanyoga/sample-sales-data)
|
||||
|
||||
---
|
||||
+5173
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
+200001
File diff suppressed because it is too large
Load Diff
Executable
+389
File diff suppressed because one or more lines are too long
Executable
+265
File diff suppressed because one or more lines are too long
Executable
+219
File diff suppressed because one or more lines are too long
Executable
+362
File diff suppressed because one or more lines are too long
BIN
Binary file not shown.
BIN
Binary file not shown.
BIN
Binary file not shown.
BIN
Binary file not shown.
Binary file not shown.
BIN
Binary file not shown.
@@ -10,6 +10,24 @@ This repository contains vital resources for the Machine Learning course under t
|
||||
|
||||
### Codes
|
||||
|
||||
1. [Code-1 (Uber)](Codes/Code-1.md)
|
||||
2. [Code-2 (Spam Email Detection)](Codes/Code-2.md)
|
||||
3. [Code-4 (Gradient Descent Algorithm)](Codes/Code-4.md)
|
||||
4. [Code-6 (Clustering)](Codes/Code-6.md)
|
||||
|
||||
### Jupyter Notebooks
|
||||
|
||||
1. [Notebook-1 (Uber)](Notebooks/Notebook-1.ipynb)
|
||||
2. [Notebook-2 (Spam Email Detection)](Notebooks/Notebook-2.ipynb)
|
||||
3. [Notebook-4 (Gradient Descent Algorithm)](Notebooks/Notebook-4.ipynb)
|
||||
4. [Notebook-6 (Clustering)](Notebooks/Notebook-6.ipynb)
|
||||
|
||||
### Datasets
|
||||
|
||||
1. [Dataset for Practical-1](Datasets/uber.csv)
|
||||
2. [Dataset for Practical-2](Datasets/emails.csv)
|
||||
3. [Dataset for Practical-3](Datasets/sales_data_sample.csv)
|
||||
|
||||
### Assignments
|
||||
|
||||
- Assignment-1:
|
||||
@@ -36,6 +54,8 @@ This repository contains vital resources for the Machine Learning course under t
|
||||
|
||||
### [IN-SEM PYQ Answers](Notes/IN-SEM%20PYQ%20Answers)
|
||||
|
||||
### [END-SEM PYQ Answers](Notes/END-SEM%20PYQ%20Answers)
|
||||
|
||||
---
|
||||
|
||||
## Miscellaneous
|
||||
|
||||
Reference in New Issue
Block a user