Upload end-sem pyq for ML, november-december 2025. Provided by Ayush Kalaskar.

Added end-sem pyq answers for unit 6. Collaborative work by Ayush Kalaskar and Himanshu Patil.
Added end-sem pyq answers for unit 5. Collaborative work by Ayush Kalaskar and Himanshu Patil.
2026-03-22 02:18:47 +05:30 · 2025-12-06 22:19:15 +05:30 · 2025-12-06 21:30:35 +05:30 · 2025-12-06 21:18:46 +05:30 · 2025-12-06 01:40:26 +05:30 · 2025-12-06 01:39:51 +05:30
18 changed files with 209759 additions and 0 deletions
@@ -0,0 +1,189 @@
+# Practical-1 (Uber)
+
+Problem Statement: Predict the price of the Uber ride from a given pickup point to the agreed drop-off location.
+Perform following tasks:
+1. Pre-process the dataset.
+2. Identify outliers.
+3. Check the correlation.
+4. Implement linear regression and random forest regression models.
+5. Evaluate the models and compare their respective scores like R2, RMSE, etc.
+
+> [!NOTE]
+> Dataset available in [Datasets](../Datasets/uber.csv) directory.
+
+---
+ 
+## Steps
+
+1. Importing Libraries
+1. Data Loading and Pre-processing
+2. Outlier Detection
+3. Correlation Analysis
+4. Model Implementation (Linear Regression & Random Forest)
+5. Model Evaluation and Comparison
+
+---
+
+## Code
+
+### 0. Importing Libraries:
+
+```python3
+# Import necessary libraries
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.model_selection import train_test_split
+from sklearn.linear_model import LinearRegression
+from sklearn.ensemble import RandomForestRegressor
+from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
+from math import radians, cos, sin, asin, sqrt
+```
+
+### 1. Data Loading & Preprocessing:
+
+```python3
+# Load the dataset
+df = pd.read_csv("uber.csv")   # change to your local path if needed
+print("Initial Data Shape:", df.shape)
+print(df.head())
+
+# Drop rows with missing values
+df.dropna(inplace=True)
+print("After dropping missing values:", df.shape)
+
+# Rename columns for easier reference
+df.rename(columns={'pickup_datetime': 'pickup_datetime'}, inplace=True)
+
+# Convert pickup_datetime to datetime object
+df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], errors='coerce')
+
+# Extract useful datetime features
+df['hour'] = df['pickup_datetime'].dt.hour
+df['day'] = df['pickup_datetime'].dt.day
+df['month'] = df['pickup_datetime'].dt.month
+df['year'] = df['pickup_datetime'].dt.year
+df['day_of_week'] = df['pickup_datetime'].dt.dayofweek
+
+# Drop datetime column (not needed as a direct feature)
+df.drop(['pickup_datetime', 'key'], axis=1, inplace=True, errors='ignore')
+
+print("\nColumns after feature extraction:\n", df.columns)
+```
+
+### 2. Outlier Detection & Removal:
+
+```python3
+# Remove entries with unrealistic fares
+df = df[(df['fare_amount'] > 0) & (df['fare_amount'] < 100)]
+
+# Remove unrealistic latitude and longitude values
+df = df[(df['pickup_latitude'] <= 90) & (df['pickup_latitude'] >= -90)]
+df = df[(df['dropoff_latitude'] <= 90) & (df['dropoff_latitude'] >= -90)]
+df = df[(df['pickup_longitude'] <= 180) & (df['pickup_longitude'] >= -180)]
+df = df[(df['dropoff_longitude'] <= 180) & (df['dropoff_longitude'] >= -180)]
+
+print("Data shape after removing outliers:", df.shape)
+```
+
+### 3. Feature Engineering - Distance Calculation:
+
+```python3
+# Define Haversine function to calculate distance between pickup and drop-off
+def haversine(lat1, lon1, lat2, lon2):
+    # convert decimal degrees to radians
+    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
+    # haversine formula
+    dlon = lon2 - lon1
+    dlat = lat2 - lat1
+    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
+    c = 2 * asin(sqrt(a))
+    km = 6371 * c
+    return km
+
+# Apply the Haversine formula
+df['distance_km'] = df.apply(lambda x: haversine(x['pickup_latitude'], x['pickup_longitude'],
+                                                 x['dropoff_latitude'], x['dropoff_longitude']), axis=1)
+
+# Remove zero-distance trips
+df = df[df['distance_km'] > 0]
+```
+
+### 4. Correlation Analysis:
+
+```python3
+plt.figure(figsize=(10, 6))
+sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
+plt.title("Feature Correlation Heatmap")
+plt.show()
+```
+
+### 5. Model Training:
+
+```python3
+# Define features and target
+X = df[['distance_km', 'hour', 'day', 'month', 'year', 'day_of_week']]
+y = df['fare_amount']
+
+# Split data into train and test sets
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
+
+# -------------------- Linear Regression --------------------
+lr_model = LinearRegression()
+lr_model.fit(X_train, y_train)
+y_pred_lr = lr_model.predict(X_test)
+
+# -------------------- Random Forest Regression --------------------
+rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
+rf_model.fit(X_train, y_train)
+y_pred_rf = rf_model.predict(X_test)
+```
+
+### 6. Model Evaluation:
+
+```python3
+def evaluate_model(y_true, y_pred, model_name):
+    r2 = r2_score(y_true, y_pred)
+    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
+    mae = mean_absolute_error(y_true, y_pred)
+    print(f"\nModel: {model_name}")
+    print(f"R² Score: {r2:.4f}")
+    print(f"RMSE: {rmse:.4f}")
+    print(f"MAE: {mae:.4f}")
+    return r2, rmse, mae
+
+# Evaluate both models
+lr_scores = evaluate_model(y_test, y_pred_lr, "Linear Regression")
+rf_scores = evaluate_model(y_test, y_pred_rf, "Random Forest Regressor")
+```
+
+### 7. Comparison:
+
+```python3
+results = pd.DataFrame({
+    'Model': ['Linear Regression', 'Random Forest Regressor'],
+    'R2': [lr_scores[0], rf_scores[0]],
+    'RMSE': [lr_scores[1], rf_scores[1]],
+    'MAE': [lr_scores[2], rf_scores[2]]
+})
+
+print("\nModel Comparison:")
+print(results)
+```
+
+```python3
+# Plot comparison
+plt.figure(figsize=(8,5))
+sns.barplot(x='Model', y='R2', data=results)
+plt.title("R² Score Comparison between Models")
+plt.show()
+```
+
+---
+
+## Miscellaneous
+
+- [Dataset source](https://www.kaggle.com/datasets/yasserh/uber-fares-dataset)
+
+---
@@ -0,0 +1,114 @@
+# Practical-2 (Spam Email Detection)
+
+Problem Statement: Classify the email using the binary classification method. Email Spam detection has two states: a) Normal State – Not Spam, b) Abnormal State – Spam. Use K-Nearest Neighbors and Support Vector Machine for classification. Analyze their performance. 
+
+> [!NOTE]
+> Dataset available in [Datasets](../Datasets/emails.csv) directory.
+
+---
+ 
+## Steps
+
+1. Import libraries
+2. Load dataset
+3. Data splitting (training and testing)
+4. KNN
+5. SVM
+6. Plotting
+
+---
+
+## Code
+
+### 1. Import libraries:
+
+```python3
+import pandas as pd
+from sklearn.model_selection import train_test_split
+from sklearn.neighbors import KNeighborsClassifier
+from sklearn.svm import SVC
+from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
+import matplotlib.pyplot as plt
+import seaborn as sns
+```
+
+### 2. Load dataset:
+
+```python3
+df = pd.read_csv("emails.csv", encoding="ISO-8859-1")  # Adjust path if needed
+
+# Drop unnecessary columns if present
+if "Email No." in df.columns:
+    df = df.drop(columns=["Email No."])
+
+# Ensure label is integer
+df["Prediction"] = df["Prediction"].astype(int)
+
+# Features & target
+X = df.drop(columns=["Prediction"])
+y = df["Prediction"]
+
+# Print basic info
+print(df.columns)
+print(df.head(5))
+```
+
+### 3. Data splitting (training and testing):
+
+```python3
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.2, random_state=42, stratify=y
+)
+```
+
+### 4. KNN:
+
+```python3
+knn = KNeighborsClassifier(n_neighbors=5)
+knn.fit(X_train, y_train)
+y_pred_knn = knn.predict(X_test)
+
+print("\n--- KNN Performance ---")
+print("Accuracy:", accuracy_score(y_test, y_pred_knn))
+print("Classification Report:\n", classification_report(y_test, y_pred_knn))
+print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))
+```
+
+### 5. SVM:
+
+```python3
+svm = SVC(kernel='linear', random_state=42)  # Linear kernel for binary classification
+svm.fit(X_train, y_train)
+y_pred_svm = svm.predict(X_test)
+
+print("\n--- SVM Performance ---")
+print("Accuracy:", accuracy_score(y_test, y_pred_svm))
+print("Classification Report:\n", classification_report(y_test, y_pred_svm))
+print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
+```
+
+### 6. Plotting:
+
+```python3
+fig, ax = plt.subplots(1, 2, figsize=(12, 5))
+
+sns.heatmap(confusion_matrix(y_test, y_pred_knn), annot=True, fmt="d", cmap="Blues", ax=ax[0])
+ax[0].set_title("KNN Confusion Matrix")
+ax[0].set_xlabel("Predicted")
+ax[0].set_ylabel("Actual")
+
+sns.heatmap(confusion_matrix(y_test, y_pred_svm), annot=True, fmt="d", cmap="Greens", ax=ax[1])
+ax[1].set_title("SVM Confusion Matrix")
+ax[1].set_xlabel("Predicted")
+ax[1].set_ylabel("Actual")
+
+plt.show()
+```
+
+---
+
+## Miscellaneous
+
+- [Dataset source](https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv)
+
+---
@@ -0,0 +1,82 @@
+# Practical-4 (Gradient Descent Algorithm)
+
+Problem Statement: Implement Gradient Descent Algorithm to find the local minima of a function. For example, find the local minima of the function y=(x+3)² starting from the point x=2.
+
+---
+ 
+## Steps
+
+1. Define the function and its derivative
+2. Initialize parameters for Gradient Descent
+3. Gradient Descent Loop
+4. Print the result
+5. Plotting
+
+---
+
+## Code
+
+### 0. Import libraries:
+
+```python3
+import numpy as np
+import matplotlib.pyplot as plt
+```
+
+### 1. Define the function and its derivative:
+
+```python3
+def f(x):
+    return (x + 3)**2
+
+def grad_f(x):
+    return 2 * (x + 3)  # derivative of f(x)
+```
+
+### 2. Initialize parameters for Gradient Descent:
+
+```python3
+x_current = 2          # starting point
+learning_rate = 0.1    # step size
+tolerance = 1e-6       # convergence tolerance
+max_iterations = 25    # maximum iterations
+history = [x_current]  # sotring history
+```
+
+### 3. Gradient Descent Loop:
+
+```python3
+for i in range(max_iterations):
+    gradient = grad_f(x_current)
+    x_next = x_current - learning_rate * gradient  # update step
+    
+    # Check convergence
+    if abs(x_next - x_current) < tolerance:
+        print(f"Converged after {i+1} iterations.")
+        break
+    
+    x_current = x_next
+    history.append(x_current)
+    print(f"Iteration {i+1}: x = {x_current:.4f}, f(x) = {f(x_current):.4f}")
+```
+
+### 4. Print the result:
+
+```python3
+print("Local minima at x =", x_current)
+print("Function value at local minima y =", f(x_current))
+```
+
+### 5. Plotting:
+
+```python3
+plt.plot(history, [f(val) for val in history], marker='o')
+plt.xlabel("x values")
+plt.ylabel("f(x)")
+plt.title("Gradient Descent Convergence")
+plt.grid()
+plt.show()
+```
+
+---
+
@@ -0,0 +1,121 @@
+# Practical-6 (Clustering)
+
+Problem Statement: Implement K-Means clustering/ hierarchical clustering on `sales_data_sample.csv` dataset. Determine the number of clusters using the elbow method.
+
+> [!NOTE]
+> Dataset available in [Datasets](../Datasets/sales_data_sample.csv) directory.
+
+---
+ 
+## Steps
+
+1. Import libraries
+2. Load dataset
+3. Select numerical features for clustering
+4. Standarize data
+5. K-Means clustering
+6. Hierarchical clustering
+
+---
+
+## Code
+
+### 1. Import libraries:
+
+```python3
+import pandas as pd
+import matplotlib.pyplot as plt
+from sklearn.preprocessing import StandardScaler
+from sklearn.cluster import KMeans
+from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
+import seaborn as sns
+```
+
+### 2. Load dataset:
+
+```python3
+df = pd.read_csv("sales_data_sample.csv", encoding='latin1', on_bad_lines='skip')
+print("Dataset shape:", df.shape)
+print(df.head())
+```
+
+### 3. Select numerical features for clustering:
+
+```python3
+X = df.select_dtypes(include=['int64', 'float64'])
+print("Features used for clustering:\n", X.head())
+
+# Select relevant numeric columns
+# X = df[['SALES', 'QUANTITYORDERED', 'PRICEEACH']]
+
+# Handle missing values if any
+# X = features.dropna()
+```
+
+### 4. Standardize data:
+
+```python3
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+```
+
+### 5. K-Means clustering:
+
+```python3
+# Determine optimal number of clusters using Elbow Method
+wcss = []
+for k in range(1, 11):
+    kmeans = KMeans(n_clusters=k, random_state=42)
+    kmeans.fit(X_scaled)
+    wcss.append(kmeans.inertia_)
+
+# Plot Elbow Method
+plt.figure(figsize=(6,4))
+plt.plot(range(1, 11), wcss, marker='o')
+plt.title('Elbow Method')
+plt.xlabel('Number of clusters (k)')
+plt.ylabel('Inertia (WCSS)')
+plt.show()
+
+# Fit KMeans with chosen number of clusters (example: 3 clusters)
+kmeans = KMeans(n_clusters=3, random_state=42) # Add n_init=10 param in the function to suppress warnings
+clusters_kmeans = kmeans.fit_predict(X_scaled)
+df['KMeans_Cluster'] = clusters_kmeans
+
+# Visualize clusters
+sns.scatterplot(x='SALES', y='PRICEEACH', hue='KMeans_Cluster', data=df, palette='viridis')
+plt.title("K-Means Clustering")
+plt.show()
+
+print("\nK-Means Cluster Centers:\n", kmeans.cluster_centers_)
+print("\nCluster counts:\n", df['KMeans_Cluster'].value_counts())
+```
+
+### 6. Hierarchical clustering:
+
+```python3
+# Create linkage matrix
+Z = linkage(X_scaled, method='ward')
+
+# Plot dendrogram
+plt.figure(figsize=(10,5))
+dendrogram(Z)
+plt.title('Hierarchical Clustering Dendrogram')
+plt.xlabel('Samples')
+plt.ylabel('Distance')
+plt.show()
+
+# Assign clusters (example: 3 clusters)
+clusters_hier = fcluster(Z, t=3, criterion='maxclust')
+df['Hierarchical_Cluster'] = clusters_hier
+
+print("\nHierarchical Cluster counts:\n", pd.Series(clusters_hier).value_counts())
+```
+
+---
+
+## Miscellaneous
+
+- [Dataset source](https://www.kaggle.com/datasets/kyanyoga/sample-sales-data)
+
+---
@@ -10,6 +10,24 @@ This repository contains vital resources for the Machine Learning course under t

 ### Codes

+1. [Code-1 (Uber)](Codes/Code-1.md)
+2. [Code-2 (Spam Email Detection)](Codes/Code-2.md)
+3. [Code-4 (Gradient Descent Algorithm)](Codes/Code-4.md)
+4. [Code-6 (Clustering)](Codes/Code-6.md)
+
+### Jupyter Notebooks
+
+1. [Notebook-1 (Uber)](Notebooks/Notebook-1.ipynb)
+2. [Notebook-2 (Spam Email Detection)](Notebooks/Notebook-2.ipynb)
+3. [Notebook-4 (Gradient Descent Algorithm)](Notebooks/Notebook-4.ipynb)
+4. [Notebook-6 (Clustering)](Notebooks/Notebook-6.ipynb)
+
+### Datasets
+
+1. [Dataset for Practical-1](Datasets/uber.csv)
+2. [Dataset for Practical-2](Datasets/emails.csv)
+3. [Dataset for Practical-3](Datasets/sales_data_sample.csv)
+
 ### Assignments

 - Assignment-1:
@@ -36,6 +54,8 @@ This repository contains vital resources for the Machine Learning course under t

 ### [IN-SEM PYQ Answers](Notes/IN-SEM%20PYQ%20Answers)

+### [END-SEM PYQ Answers](Notes/END-SEM%20PYQ%20Answers)
+
 ---

 ## Miscellaneous
Author	SHA1	Message	Date
notkshitij	74f3731e77	Upload end-sem pyq for ML, november-december 2025. Provided by Ayush Kalaskar.	2026-03-22 02:18:47 +05:30
notkshitij	e7cae4d012	Added end-sem pyq answers for unit 6. Collaborative work by Ayush Kalaskar and Himanshu Patil.	2025-12-06 22:19:15 +05:30
notkshitij	7acb4c9f4c	Added end-sem pyq answers for unit 5. Collaborative work by Ayush Kalaskar and Himanshu Patil.	2025-12-06 21:30:35 +05:30
notkshitij	63dd320c66	Minor fix: Changed MAY/JUNE 2022 to NOV/DEC 2022.	2025-12-06 21:18:46 +05:30
notkshitij	8619784c68	Added link to end-sem pyq answers.	2025-12-06 01:40:26 +05:30
notkshitij	124f6b7878	Added end-sem pyq answers for unit 4. Collaborative work by Ayush Kalaskar and Himanshu Patil.	2025-12-06 01:39:51 +05:30
notkshitij	d1004db0af	Added end-sem pyq answers or unit 3. Collaborative work by Ayush Kalaskar and Himanshu Patil.	2025-12-06 00:54:34 +05:30
notkshitij	c708ed57ad	Added may-june 2025 end-sem pyq (ml)	2025-12-02 13:53:39 +05:30
notkshitij	c313cb0ec9	Fixed hierarchical spelling.	2025-11-05 18:57:34 +05:30
notkshitij	a588629812	Added links to codes, notebooks and datasets in readme.	2025-11-03 00:15:17 +05:30
notkshitij	c0c22a12e7	Modified file permissions for datasets.	2025-11-03 00:15:05 +05:30
notkshitij	95f1dcc828	Fixed naming for notebooks.	2025-11-03 00:14:40 +05:30
notkshitij	ff7638bd70	Improved formatting for markdown codes and fixed title for all.	2025-11-03 00:12:37 +05:30
notkshitij	1432c59bc4	Added code in markdown format for A6.	2025-11-03 00:04:26 +05:30
notkshitij	bb9a370a98	Added jupyter notebook for practical A6 and its dataset.	2025-11-03 00:04:08 +05:30
notkshitij	c4c460a81f	Fixed name for a4 code (in codes dir)	2025-11-03 00:00:04 +05:30
notkshitij	a0d06838c2	Added code for practical A4 in markdown format.	2025-11-02 23:34:41 +05:30
notkshitij	14d70779e9	Added jupyter notebook for practical A4.	2025-11-02 23:34:06 +05:30
notkshitij	8cf306ce2a	Added code in markdown format for code a2.	2025-11-02 22:03:28 +05:30
notkshitij	0e97128ff2	Added jupyter notebook and dataset for practical a2 (spam email detection)	2025-11-02 22:03:10 +05:30
notkshitij	5df648ac33	Added import libraries part.	2025-11-02 22:02:44 +05:30
notkshitij	e5347feffc	Changed dataset to dataset source in code a1	2025-11-02 21:59:18 +05:30
notkshitij	9c83c1aab2	Added name in title Code-A1.md	2025-11-02 21:49:16 +05:30
notkshitij	fc3b508b39	Added code in markdown format for practical A1 (uber ride)	2025-11-02 20:32:12 +05:30
notkshitij	ce5c95856d	Added Jupyter notebook and dataset for practical A1 (uber ride)	2025-11-02 20:31:52 +05:30