Compare commits

..

25 Commits

Author SHA1 Message Date
notkshitij 74f3731e77 Upload end-sem pyq for ML, november-december 2025. Provided by Ayush Kalaskar. 2026-03-22 02:18:47 +05:30
notkshitij e7cae4d012 Added end-sem pyq answers for unit 6. Collaborative work by Ayush Kalaskar and Himanshu Patil. 2025-12-06 22:19:15 +05:30
notkshitij 7acb4c9f4c Added end-sem pyq answers for unit 5. Collaborative work by Ayush Kalaskar and Himanshu Patil. 2025-12-06 21:30:35 +05:30
notkshitij 63dd320c66 Minor fix: Changed MAY/JUNE 2022 to NOV/DEC 2022. 2025-12-06 21:18:46 +05:30
notkshitij 8619784c68 Added link to end-sem pyq answers. 2025-12-06 01:40:26 +05:30
notkshitij 124f6b7878 Added end-sem pyq answers for unit 4. Collaborative work by Ayush Kalaskar and Himanshu Patil. 2025-12-06 01:39:51 +05:30
notkshitij d1004db0af Added end-sem pyq answers or unit 3. Collaborative work by Ayush Kalaskar and Himanshu Patil. 2025-12-06 00:54:34 +05:30
notkshitij c708ed57ad Added may-june 2025 end-sem pyq (ml) 2025-12-02 13:53:39 +05:30
notkshitij c313cb0ec9 Fixed hierarchical spelling. 2025-11-05 18:57:34 +05:30
notkshitij a588629812 Added links to codes, notebooks and datasets in readme. 2025-11-03 00:15:17 +05:30
notkshitij c0c22a12e7 Modified file permissions for datasets. 2025-11-03 00:15:05 +05:30
notkshitij 95f1dcc828 Fixed naming for notebooks. 2025-11-03 00:14:40 +05:30
notkshitij ff7638bd70 Improved formatting for markdown codes and fixed title for all. 2025-11-03 00:12:37 +05:30
notkshitij 1432c59bc4 Added code in markdown format for A6. 2025-11-03 00:04:26 +05:30
notkshitij bb9a370a98 Added jupyter notebook for practical A6 and its dataset. 2025-11-03 00:04:08 +05:30
notkshitij c4c460a81f Fixed name for a4 code (in codes dir) 2025-11-03 00:00:04 +05:30
notkshitij a0d06838c2 Added code for practical A4 in markdown format. 2025-11-02 23:34:41 +05:30
notkshitij 14d70779e9 Added jupyter notebook for practical A4. 2025-11-02 23:34:06 +05:30
notkshitij 8cf306ce2a Added code in markdown format for code a2. 2025-11-02 22:03:28 +05:30
notkshitij 0e97128ff2 Added jupyter notebook and dataset for practical a2 (spam email detection) 2025-11-02 22:03:10 +05:30
notkshitij 5df648ac33 Added import libraries part. 2025-11-02 22:02:44 +05:30
notkshitij e5347feffc Changed dataset to dataset source in code a1 2025-11-02 21:59:18 +05:30
notkshitij 9c83c1aab2 Added name in title Code-A1.md 2025-11-02 21:49:16 +05:30
notkshitij fc3b508b39 Added code in markdown format for practical A1 (uber ride) 2025-11-02 20:32:12 +05:30
notkshitij ce5c95856d Added Jupyter notebook and dataset for practical A1 (uber ride) 2025-11-02 20:31:52 +05:30
18 changed files with 209759 additions and 0 deletions
+189
View File
@@ -0,0 +1,189 @@
# Practical-1 (Uber)
Problem Statement: Predict the price of the Uber ride from a given pickup point to the agreed drop-off location.
Perform following tasks:
1. Pre-process the dataset.
2. Identify outliers.
3. Check the correlation.
4. Implement linear regression and random forest regression models.
5. Evaluate the models and compare their respective scores like R2, RMSE, etc.
> [!NOTE]
> Dataset available in [Datasets](../Datasets/uber.csv) directory.
---
## Steps
1. Importing Libraries
1. Data Loading and Pre-processing
2. Outlier Detection
3. Correlation Analysis
4. Model Implementation (Linear Regression & Random Forest)
5. Model Evaluation and Comparison
---
## Code
### 0. Importing Libraries:
```python3
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import radians, cos, sin, asin, sqrt
```
### 1. Data Loading & Preprocessing:
```python3
# Load the dataset
df = pd.read_csv("uber.csv") # change to your local path if needed
print("Initial Data Shape:", df.shape)
print(df.head())
# Drop rows with missing values
df.dropna(inplace=True)
print("After dropping missing values:", df.shape)
# Rename columns for easier reference
df.rename(columns={'pickup_datetime': 'pickup_datetime'}, inplace=True)
# Convert pickup_datetime to datetime object
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], errors='coerce')
# Extract useful datetime features
df['hour'] = df['pickup_datetime'].dt.hour
df['day'] = df['pickup_datetime'].dt.day
df['month'] = df['pickup_datetime'].dt.month
df['year'] = df['pickup_datetime'].dt.year
df['day_of_week'] = df['pickup_datetime'].dt.dayofweek
# Drop datetime column (not needed as a direct feature)
df.drop(['pickup_datetime', 'key'], axis=1, inplace=True, errors='ignore')
print("\nColumns after feature extraction:\n", df.columns)
```
### 2. Outlier Detection & Removal:
```python3
# Remove entries with unrealistic fares
df = df[(df['fare_amount'] > 0) & (df['fare_amount'] < 100)]
# Remove unrealistic latitude and longitude values
df = df[(df['pickup_latitude'] <= 90) & (df['pickup_latitude'] >= -90)]
df = df[(df['dropoff_latitude'] <= 90) & (df['dropoff_latitude'] >= -90)]
df = df[(df['pickup_longitude'] <= 180) & (df['pickup_longitude'] >= -180)]
df = df[(df['dropoff_longitude'] <= 180) & (df['dropoff_longitude'] >= -180)]
print("Data shape after removing outliers:", df.shape)
```
### 3. Feature Engineering - Distance Calculation:
```python3
# Define Haversine function to calculate distance between pickup and drop-off
def haversine(lat1, lon1, lat2, lon2):
# convert decimal degrees to radians
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6371 * c
return km
# Apply the Haversine formula
df['distance_km'] = df.apply(lambda x: haversine(x['pickup_latitude'], x['pickup_longitude'],
x['dropoff_latitude'], x['dropoff_longitude']), axis=1)
# Remove zero-distance trips
df = df[df['distance_km'] > 0]
```
### 4. Correlation Analysis:
```python3
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation Heatmap")
plt.show()
```
### 5. Model Training:
```python3
# Define features and target
X = df[['distance_km', 'hour', 'day', 'month', 'year', 'day_of_week']]
y = df['fare_amount']
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# -------------------- Linear Regression --------------------
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
# -------------------- Random Forest Regression --------------------
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
```
### 6. Model Evaluation:
```python3
def evaluate_model(y_true, y_pred, model_name):
r2 = r2_score(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
print(f"\nModel: {model_name}")
print(f"R² Score: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
return r2, rmse, mae
# Evaluate both models
lr_scores = evaluate_model(y_test, y_pred_lr, "Linear Regression")
rf_scores = evaluate_model(y_test, y_pred_rf, "Random Forest Regressor")
```
### 7. Comparison:
```python3
results = pd.DataFrame({
'Model': ['Linear Regression', 'Random Forest Regressor'],
'R2': [lr_scores[0], rf_scores[0]],
'RMSE': [lr_scores[1], rf_scores[1]],
'MAE': [lr_scores[2], rf_scores[2]]
})
print("\nModel Comparison:")
print(results)
```
```python3
# Plot comparison
plt.figure(figsize=(8,5))
sns.barplot(x='Model', y='R2', data=results)
plt.title("R² Score Comparison between Models")
plt.show()
```
---
## Miscellaneous
- [Dataset source](https://www.kaggle.com/datasets/yasserh/uber-fares-dataset)
---
+114
View File
@@ -0,0 +1,114 @@
# Practical-2 (Spam Email Detection)
Problem Statement: Classify the email using the binary classification method. Email Spam detection has two states: a) Normal State Not Spam, b) Abnormal State Spam. Use K-Nearest Neighbors and Support Vector Machine for classification. Analyze their performance.
> [!NOTE]
> Dataset available in [Datasets](../Datasets/emails.csv) directory.
---
## Steps
1. Import libraries
2. Load dataset
3. Data splitting (training and testing)
4. KNN
5. SVM
6. Plotting
---
## Code
### 1. Import libraries:
```python3
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
```
### 2. Load dataset:
```python3
df = pd.read_csv("emails.csv", encoding="ISO-8859-1") # Adjust path if needed
# Drop unnecessary columns if present
if "Email No." in df.columns:
df = df.drop(columns=["Email No."])
# Ensure label is integer
df["Prediction"] = df["Prediction"].astype(int)
# Features & target
X = df.drop(columns=["Prediction"])
y = df["Prediction"]
# Print basic info
print(df.columns)
print(df.head(5))
```
### 3. Data splitting (training and testing):
```python3
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
```
### 4. KNN:
```python3
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print("\n--- KNN Performance ---")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print("Classification Report:\n", classification_report(y_test, y_pred_knn))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))
```
### 5. SVM:
```python3
svm = SVC(kernel='linear', random_state=42) # Linear kernel for binary classification
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
print("\n--- SVM Performance ---")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Classification Report:\n", classification_report(y_test, y_pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
```
### 6. Plotting:
```python3
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
sns.heatmap(confusion_matrix(y_test, y_pred_knn), annot=True, fmt="d", cmap="Blues", ax=ax[0])
ax[0].set_title("KNN Confusion Matrix")
ax[0].set_xlabel("Predicted")
ax[0].set_ylabel("Actual")
sns.heatmap(confusion_matrix(y_test, y_pred_svm), annot=True, fmt="d", cmap="Greens", ax=ax[1])
ax[1].set_title("SVM Confusion Matrix")
ax[1].set_xlabel("Predicted")
ax[1].set_ylabel("Actual")
plt.show()
```
---
## Miscellaneous
- [Dataset source](https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv)
---
+82
View File
@@ -0,0 +1,82 @@
# Practical-4 (Gradient Descent Algorithm)
Problem Statement: Implement Gradient Descent Algorithm to find the local minima of a function. For example, find the local minima of the function y=(x+3)² starting from the point x=2.
---
## Steps
1. Define the function and its derivative
2. Initialize parameters for Gradient Descent
3. Gradient Descent Loop
4. Print the result
5. Plotting
---
## Code
### 0. Import libraries:
```python3
import numpy as np
import matplotlib.pyplot as plt
```
### 1. Define the function and its derivative:
```python3
def f(x):
return (x + 3)**2
def grad_f(x):
return 2 * (x + 3) # derivative of f(x)
```
### 2. Initialize parameters for Gradient Descent:
```python3
x_current = 2 # starting point
learning_rate = 0.1 # step size
tolerance = 1e-6 # convergence tolerance
max_iterations = 25 # maximum iterations
history = [x_current] # sotring history
```
### 3. Gradient Descent Loop:
```python3
for i in range(max_iterations):
gradient = grad_f(x_current)
x_next = x_current - learning_rate * gradient # update step
# Check convergence
if abs(x_next - x_current) < tolerance:
print(f"Converged after {i+1} iterations.")
break
x_current = x_next
history.append(x_current)
print(f"Iteration {i+1}: x = {x_current:.4f}, f(x) = {f(x_current):.4f}")
```
### 4. Print the result:
```python3
print("Local minima at x =", x_current)
print("Function value at local minima y =", f(x_current))
```
### 5. Plotting:
```python3
plt.plot(history, [f(val) for val in history], marker='o')
plt.xlabel("x values")
plt.ylabel("f(x)")
plt.title("Gradient Descent Convergence")
plt.grid()
plt.show()
```
---
+121
View File
@@ -0,0 +1,121 @@
# Practical-6 (Clustering)
Problem Statement: Implement K-Means clustering/ hierarchical clustering on `sales_data_sample.csv` dataset. Determine the number of clusters using the elbow method.
> [!NOTE]
> Dataset available in [Datasets](../Datasets/sales_data_sample.csv) directory.
---
## Steps
1. Import libraries
2. Load dataset
3. Select numerical features for clustering
4. Standarize data
5. K-Means clustering
6. Hierarchical clustering
---
## Code
### 1. Import libraries:
```python3
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import seaborn as sns
```
### 2. Load dataset:
```python3
df = pd.read_csv("sales_data_sample.csv", encoding='latin1', on_bad_lines='skip')
print("Dataset shape:", df.shape)
print(df.head())
```
### 3. Select numerical features for clustering:
```python3
X = df.select_dtypes(include=['int64', 'float64'])
print("Features used for clustering:\n", X.head())
# Select relevant numeric columns
# X = df[['SALES', 'QUANTITYORDERED', 'PRICEEACH']]
# Handle missing values if any
# X = features.dropna()
```
### 4. Standardize data:
```python3
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```
### 5. K-Means clustering:
```python3
# Determine optimal number of clusters using Elbow Method
wcss = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
wcss.append(kmeans.inertia_)
# Plot Elbow Method
plt.figure(figsize=(6,4))
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia (WCSS)')
plt.show()
# Fit KMeans with chosen number of clusters (example: 3 clusters)
kmeans = KMeans(n_clusters=3, random_state=42) # Add n_init=10 param in the function to suppress warnings
clusters_kmeans = kmeans.fit_predict(X_scaled)
df['KMeans_Cluster'] = clusters_kmeans
# Visualize clusters
sns.scatterplot(x='SALES', y='PRICEEACH', hue='KMeans_Cluster', data=df, palette='viridis')
plt.title("K-Means Clustering")
plt.show()
print("\nK-Means Cluster Centers:\n", kmeans.cluster_centers_)
print("\nCluster counts:\n", df['KMeans_Cluster'].value_counts())
```
### 6. Hierarchical clustering:
```python3
# Create linkage matrix
Z = linkage(X_scaled, method='ward')
# Plot dendrogram
plt.figure(figsize=(10,5))
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.show()
# Assign clusters (example: 3 clusters)
clusters_hier = fcluster(Z, t=3, criterion='maxclust')
df['Hierarchical_Cluster'] = clusters_hier
print("\nHierarchical Cluster counts:\n", pd.Series(clusters_hier).value_counts())
```
---
## Miscellaneous
- [Dataset source](https://www.kaggle.com/datasets/kyanyoga/sample-sales-data)
---
+5173
View File
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large Load Diff
+200001
View File
File diff suppressed because it is too large Load Diff
+389
View File
File diff suppressed because one or more lines are too long
+265
View File
File diff suppressed because one or more lines are too long
+219
View File
File diff suppressed because one or more lines are too long
+362
View File
File diff suppressed because one or more lines are too long
+20
View File
@@ -10,6 +10,24 @@ This repository contains vital resources for the Machine Learning course under t
### Codes ### Codes
1. [Code-1 (Uber)](Codes/Code-1.md)
2. [Code-2 (Spam Email Detection)](Codes/Code-2.md)
3. [Code-4 (Gradient Descent Algorithm)](Codes/Code-4.md)
4. [Code-6 (Clustering)](Codes/Code-6.md)
### Jupyter Notebooks
1. [Notebook-1 (Uber)](Notebooks/Notebook-1.ipynb)
2. [Notebook-2 (Spam Email Detection)](Notebooks/Notebook-2.ipynb)
3. [Notebook-4 (Gradient Descent Algorithm)](Notebooks/Notebook-4.ipynb)
4. [Notebook-6 (Clustering)](Notebooks/Notebook-6.ipynb)
### Datasets
1. [Dataset for Practical-1](Datasets/uber.csv)
2. [Dataset for Practical-2](Datasets/emails.csv)
3. [Dataset for Practical-3](Datasets/sales_data_sample.csv)
### Assignments ### Assignments
- Assignment-1: - Assignment-1:
@@ -36,6 +54,8 @@ This repository contains vital resources for the Machine Learning course under t
### [IN-SEM PYQ Answers](Notes/IN-SEM%20PYQ%20Answers) ### [IN-SEM PYQ Answers](Notes/IN-SEM%20PYQ%20Answers)
### [END-SEM PYQ Answers](Notes/END-SEM%20PYQ%20Answers)
--- ---
## Miscellaneous ## Miscellaneous