Upload end-sem pyq for ML, november-december 2025. Provided by Ayush Kalaskar.

Added end-sem pyq answers for unit 6. Collaborative work by Ayush Kalaskar and Himanshu Patil.
Added end-sem pyq answers for unit 5. Collaborative work by Ayush Kalaskar and Himanshu Patil.
2026-03-22 02:18:47 +05:30 · 2025-12-06 22:19:15 +05:30 · 2025-12-06 21:30:35 +05:30 · 2025-12-06 21:18:46 +05:30 · 2025-12-06 01:40:26 +05:30 · 2025-12-06 01:39:51 +05:30
18 changed files with 209759 additions and 0 deletions
@@ -0,0 +1,189 @@
 # Practical-1 (Uber)
 Problem Statement: Predict the price of the Uber ride from a given pickup point to the agreed drop-off location.
 Perform following tasks:
 1. Pre-process the dataset.
 2. Identify outliers.
 3. Check the correlation.
 4. Implement linear regression and random forest regression models.
 5. Evaluate the models and compare their respective scores like R2, RMSE, etc.
 > [!NOTE]
 > Dataset available in [Datasets](../Datasets/uber.csv) directory.
 ---
 ## Steps
 1. Importing Libraries
 1. Data Loading and Pre-processing
 2. Outlier Detection
 3. Correlation Analysis
 4. Model Implementation (Linear Regression & Random Forest)
 5. Model Evaluation and Comparison
 ---
 ## Code
 ### 0. Importing Libraries:
 ```python3
 # Import necessary libraries
 import pandas as pd
 import numpy as np
 import matplotlib.pyplot as plt
 import seaborn as sns
 from sklearn.model_selection import train_test_split
 from sklearn.linear_model import LinearRegression
 from sklearn.ensemble import RandomForestRegressor
 from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
 from math import radians, cos, sin, asin, sqrt
 ```
 ### 1. Data Loading & Preprocessing:
 ```python3
 # Load the dataset
 df = pd.read_csv("uber.csv")   # change to your local path if needed
 print("Initial Data Shape:", df.shape)
 print(df.head())
 # Drop rows with missing values
 df.dropna(inplace=True)
 print("After dropping missing values:", df.shape)
 # Rename columns for easier reference
 df.rename(columns={'pickup_datetime': 'pickup_datetime'}, inplace=True)
 # Convert pickup_datetime to datetime object
 df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], errors='coerce')
 # Extract useful datetime features
 df['hour'] = df['pickup_datetime'].dt.hour
 df['day'] = df['pickup_datetime'].dt.day
 df['month'] = df['pickup_datetime'].dt.month
 df['year'] = df['pickup_datetime'].dt.year
 df['day_of_week'] = df['pickup_datetime'].dt.dayofweek
 # Drop datetime column (not needed as a direct feature)
 df.drop(['pickup_datetime', 'key'], axis=1, inplace=True, errors='ignore')
 print("\nColumns after feature extraction:\n", df.columns)
 ```
 ### 2. Outlier Detection & Removal:
 ```python3
 # Remove entries with unrealistic fares
 df = df[(df['fare_amount'] > 0) & (df['fare_amount'] < 100)]
 # Remove unrealistic latitude and longitude values
 df = df[(df['pickup_latitude'] <= 90) & (df['pickup_latitude'] >= -90)]
 df = df[(df['dropoff_latitude'] <= 90) & (df['dropoff_latitude'] >= -90)]
 df = df[(df['pickup_longitude'] <= 180) & (df['pickup_longitude'] >= -180)]
 df = df[(df['dropoff_longitude'] <= 180) & (df['dropoff_longitude'] >= -180)]
 print("Data shape after removing outliers:", df.shape)
 ```
 ### 3. Feature Engineering - Distance Calculation:
 ```python3
 # Define Haversine function to calculate distance between pickup and drop-off
 def haversine(lat1, lon1, lat2, lon2):
    # convert decimal degrees to radians
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    # haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    km = 6371 * c
    return km
 # Apply the Haversine formula
 df['distance_km'] = df.apply(lambda x: haversine(x['pickup_latitude'], x['pickup_longitude'],
                                                 x['dropoff_latitude'], x['dropoff_longitude']), axis=1)
 # Remove zero-distance trips
 df = df[df['distance_km'] > 0]
 ```
 ### 4. Correlation Analysis:
 ```python3
 plt.figure(figsize=(10, 6))
 sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
 plt.title("Feature Correlation Heatmap")
 plt.show()
 ```
 ### 5. Model Training:
 ```python3
 # Define features and target
 X = df[['distance_km', 'hour', 'day', 'month', 'year', 'day_of_week']]
 y = df['fare_amount']
 # Split data into train and test sets
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 # -------------------- Linear Regression --------------------
 lr_model = LinearRegression()
 lr_model.fit(X_train, y_train)
 y_pred_lr = lr_model.predict(X_test)
 # -------------------- Random Forest Regression --------------------
 rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
 rf_model.fit(X_train, y_train)
 y_pred_rf = rf_model.predict(X_test)
 ```
 ### 6. Model Evaluation:
 ```python3
 def evaluate_model(y_true, y_pred, model_name):
    r2 = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    print(f"\nModel: {model_name}")
    print(f"R² Score: {r2:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"MAE: {mae:.4f}")
    return r2, rmse, mae
 # Evaluate both models
 lr_scores = evaluate_model(y_test, y_pred_lr, "Linear Regression")
 rf_scores = evaluate_model(y_test, y_pred_rf, "Random Forest Regressor")
 ```
 ### 7. Comparison:
 ```python3
 results = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest Regressor'],
    'R2': [lr_scores[0], rf_scores[0]],
    'RMSE': [lr_scores[1], rf_scores[1]],
    'MAE': [lr_scores[2], rf_scores[2]]
 })
 print("\nModel Comparison:")
 print(results)
 ```
 ```python3
 # Plot comparison
 plt.figure(figsize=(8,5))
 sns.barplot(x='Model', y='R2', data=results)
 plt.title("R² Score Comparison between Models")
 plt.show()
 ```
 ---
 ## Miscellaneous
 - [Dataset source](https://www.kaggle.com/datasets/yasserh/uber-fares-dataset)
 ---
@@ -0,0 +1,114 @@
 # Practical-2 (Spam Email Detection)
 Problem Statement: Classify the email using the binary classification method. Email Spam detection has two states: a) Normal State – Not Spam, b) Abnormal State – Spam. Use K-Nearest Neighbors and Support Vector Machine for classification. Analyze their performance. 
 > [!NOTE]
 > Dataset available in [Datasets](../Datasets/emails.csv) directory.
 ---
 ## Steps
 1. Import libraries
 2. Load dataset
 3. Data splitting (training and testing)
 4. KNN
 5. SVM
 6. Plotting
 ---
 ## Code
 ### 1. Import libraries:
 ```python3
 import pandas as pd
 from sklearn.model_selection import train_test_split
 from sklearn.neighbors import KNeighborsClassifier
 from sklearn.svm import SVC
 from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
 import matplotlib.pyplot as plt
 import seaborn as sns
 ```
 ### 2. Load dataset:
 ```python3
 df = pd.read_csv("emails.csv", encoding="ISO-8859-1")  # Adjust path if needed
 # Drop unnecessary columns if present
 if "Email No." in df.columns:
    df = df.drop(columns=["Email No."])
 # Ensure label is integer
 df["Prediction"] = df["Prediction"].astype(int)
 # Features & target
 X = df.drop(columns=["Prediction"])
 y = df["Prediction"]
 # Print basic info
 print(df.columns)
 print(df.head(5))
 ```
 ### 3. Data splitting (training and testing):
 ```python3
 X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
 )
 ```
 ### 4. KNN:
 ```python3
 knn = KNeighborsClassifier(n_neighbors=5)
 knn.fit(X_train, y_train)
 y_pred_knn = knn.predict(X_test)
 print("\n--- KNN Performance ---")
 print("Accuracy:", accuracy_score(y_test, y_pred_knn))
 print("Classification Report:\n", classification_report(y_test, y_pred_knn))
 print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))
 ```
 ### 5. SVM:
 ```python3
 svm = SVC(kernel='linear', random_state=42)  # Linear kernel for binary classification
 svm.fit(X_train, y_train)
 y_pred_svm = svm.predict(X_test)
 print("\n--- SVM Performance ---")
 print("Accuracy:", accuracy_score(y_test, y_pred_svm))
 print("Classification Report:\n", classification_report(y_test, y_pred_svm))
 print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
 ```
 ### 6. Plotting:
 ```python3
 fig, ax = plt.subplots(1, 2, figsize=(12, 5))
 sns.heatmap(confusion_matrix(y_test, y_pred_knn), annot=True, fmt="d", cmap="Blues", ax=ax[0])
 ax[0].set_title("KNN Confusion Matrix")
 ax[0].set_xlabel("Predicted")
 ax[0].set_ylabel("Actual")
 sns.heatmap(confusion_matrix(y_test, y_pred_svm), annot=True, fmt="d", cmap="Greens", ax=ax[1])
 ax[1].set_title("SVM Confusion Matrix")
 ax[1].set_xlabel("Predicted")
 ax[1].set_ylabel("Actual")
 plt.show()
 ```
 ---
 ## Miscellaneous
 - [Dataset source](https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv)
 ---
@@ -0,0 +1,82 @@
 # Practical-4 (Gradient Descent Algorithm)
 Problem Statement: Implement Gradient Descent Algorithm to find the local minima of a function. For example, find the local minima of the function y=(x+3)² starting from the point x=2.
 ---
 ## Steps
 1. Define the function and its derivative
 2. Initialize parameters for Gradient Descent
 3. Gradient Descent Loop
 4. Print the result
 5. Plotting
 ---
 ## Code
 ### 0. Import libraries:
 ```python3
 import numpy as np
 import matplotlib.pyplot as plt
 ```
 ### 1. Define the function and its derivative:
 ```python3
 def f(x):
    return (x + 3)**2
 def grad_f(x):
    return 2 * (x + 3)  # derivative of f(x)
 ```
 ### 2. Initialize parameters for Gradient Descent:
 ```python3
 x_current = 2          # starting point
 learning_rate = 0.1    # step size
 tolerance = 1e-6       # convergence tolerance
 max_iterations = 25    # maximum iterations
 history = [x_current]  # sotring history
 ```
 ### 3. Gradient Descent Loop:
 ```python3
 for i in range(max_iterations):
    gradient = grad_f(x_current)
    x_next = x_current - learning_rate * gradient  # update step
    # Check convergence
    if abs(x_next - x_current) < tolerance:
        print(f"Converged after {i+1} iterations.")
        break
    x_current = x_next
    history.append(x_current)
    print(f"Iteration {i+1}: x = {x_current:.4f}, f(x) = {f(x_current):.4f}")
 ```
 ### 4. Print the result:
 ```python3
 print("Local minima at x =", x_current)
 print("Function value at local minima y =", f(x_current))
 ```
 ### 5. Plotting:
 ```python3
 plt.plot(history, [f(val) for val in history], marker='o')
 plt.xlabel("x values")
 plt.ylabel("f(x)")
 plt.title("Gradient Descent Convergence")
 plt.grid()
 plt.show()
 ```
 ---
@@ -0,0 +1,121 @@
 # Practical-6 (Clustering)
 Problem Statement: Implement K-Means clustering/ hierarchical clustering on `sales_data_sample.csv` dataset. Determine the number of clusters using the elbow method.
 > [!NOTE]
 > Dataset available in [Datasets](../Datasets/sales_data_sample.csv) directory.
 ---
 ## Steps
 1. Import libraries
 2. Load dataset
 3. Select numerical features for clustering
 4. Standarize data
 5. K-Means clustering
 6. Hierarchical clustering
 ---
 ## Code
 ### 1. Import libraries:
 ```python3
 import pandas as pd
 import matplotlib.pyplot as plt
 from sklearn.preprocessing import StandardScaler
 from sklearn.cluster import KMeans
 from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
 import seaborn as sns
 ```
 ### 2. Load dataset:
 ```python3
 df = pd.read_csv("sales_data_sample.csv", encoding='latin1', on_bad_lines='skip')
 print("Dataset shape:", df.shape)
 print(df.head())
 ```
 ### 3. Select numerical features for clustering:
 ```python3
 X = df.select_dtypes(include=['int64', 'float64'])
 print("Features used for clustering:\n", X.head())
 # Select relevant numeric columns
 # X = df[['SALES', 'QUANTITYORDERED', 'PRICEEACH']]
 # Handle missing values if any
 # X = features.dropna()
 ```
 ### 4. Standardize data:
 ```python3
 scaler = StandardScaler()
 X_scaled = scaler.fit_transform(X)
 ```
 ### 5. K-Means clustering:
 ```python3
 # Determine optimal number of clusters using Elbow Method
 wcss = []
 for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)
 # Plot Elbow Method
 plt.figure(figsize=(6,4))
 plt.plot(range(1, 11), wcss, marker='o')
 plt.title('Elbow Method')
 plt.xlabel('Number of clusters (k)')
 plt.ylabel('Inertia (WCSS)')
 plt.show()
 # Fit KMeans with chosen number of clusters (example: 3 clusters)
 kmeans = KMeans(n_clusters=3, random_state=42) # Add n_init=10 param in the function to suppress warnings
 clusters_kmeans = kmeans.fit_predict(X_scaled)
 df['KMeans_Cluster'] = clusters_kmeans
 # Visualize clusters
 sns.scatterplot(x='SALES', y='PRICEEACH', hue='KMeans_Cluster', data=df, palette='viridis')
 plt.title("K-Means Clustering")
 plt.show()
 print("\nK-Means Cluster Centers:\n", kmeans.cluster_centers_)
 print("\nCluster counts:\n", df['KMeans_Cluster'].value_counts())
 ```
 ### 6. Hierarchical clustering:
 ```python3
 # Create linkage matrix
 Z = linkage(X_scaled, method='ward')
 # Plot dendrogram
 plt.figure(figsize=(10,5))
 dendrogram(Z)
 plt.title('Hierarchical Clustering Dendrogram')
 plt.xlabel('Samples')
 plt.ylabel('Distance')
 plt.show()
 # Assign clusters (example: 3 clusters)
 clusters_hier = fcluster(Z, t=3, criterion='maxclust')
 df['Hierarchical_Cluster'] = clusters_hier
 print("\nHierarchical Cluster counts:\n", pd.Series(clusters_hier).value_counts())
 ```
 ---
 ## Miscellaneous
 - [Dataset source](https://www.kaggle.com/datasets/kyanyoga/sample-sales-data)
 ---
@@ -10,6 +10,24 @@ This repository contains vital resources for the Machine Learning course under t
 ### Codes
 1. [Code-1 (Uber)](Codes/Code-1.md)
 2. [Code-2 (Spam Email Detection)](Codes/Code-2.md)
 3. [Code-4 (Gradient Descent Algorithm)](Codes/Code-4.md)
 4. [Code-6 (Clustering)](Codes/Code-6.md)
 ### Jupyter Notebooks
 1. [Notebook-1 (Uber)](Notebooks/Notebook-1.ipynb)
 2. [Notebook-2 (Spam Email Detection)](Notebooks/Notebook-2.ipynb)
 3. [Notebook-4 (Gradient Descent Algorithm)](Notebooks/Notebook-4.ipynb)
 4. [Notebook-6 (Clustering)](Notebooks/Notebook-6.ipynb)
 ### Datasets
 1. [Dataset for Practical-1](Datasets/uber.csv)
 2. [Dataset for Practical-2](Datasets/emails.csv)
 3. [Dataset for Practical-3](Datasets/sales_data_sample.csv)
 ### Assignments
 - Assignment-1:
@@ -36,6 +54,8 @@ This repository contains vital resources for the Machine Learning course under t
 ### [IN-SEM PYQ Answers](Notes/IN-SEM%20PYQ%20Answers)
 ### [END-SEM PYQ Answers](Notes/END-SEM%20PYQ%20Answers)
 ---
 ## Miscellaneous
Author	SHA1	Message	Date
notkshitij	74f3731e77	Upload end-sem pyq for ML, november-december 2025. Provided by Ayush Kalaskar.	2026-03-22 02:18:47 +05:30
notkshitij	e7cae4d012	Added end-sem pyq answers for unit 6. Collaborative work by Ayush Kalaskar and Himanshu Patil.	2025-12-06 22:19:15 +05:30
notkshitij	7acb4c9f4c	Added end-sem pyq answers for unit 5. Collaborative work by Ayush Kalaskar and Himanshu Patil.	2025-12-06 21:30:35 +05:30
notkshitij	63dd320c66	Minor fix: Changed MAY/JUNE 2022 to NOV/DEC 2022.	2025-12-06 21:18:46 +05:30
notkshitij	8619784c68	Added link to end-sem pyq answers.	2025-12-06 01:40:26 +05:30
notkshitij	124f6b7878	Added end-sem pyq answers for unit 4. Collaborative work by Ayush Kalaskar and Himanshu Patil.	2025-12-06 01:39:51 +05:30
notkshitij	d1004db0af	Added end-sem pyq answers or unit 3. Collaborative work by Ayush Kalaskar and Himanshu Patil.	2025-12-06 00:54:34 +05:30
notkshitij	c708ed57ad	Added may-june 2025 end-sem pyq (ml)	2025-12-02 13:53:39 +05:30
notkshitij	c313cb0ec9	Fixed hierarchical spelling.	2025-11-05 18:57:34 +05:30
notkshitij	a588629812	Added links to codes, notebooks and datasets in readme.	2025-11-03 00:15:17 +05:30
notkshitij	c0c22a12e7	Modified file permissions for datasets.	2025-11-03 00:15:05 +05:30
notkshitij	95f1dcc828	Fixed naming for notebooks.	2025-11-03 00:14:40 +05:30
notkshitij	ff7638bd70	Improved formatting for markdown codes and fixed title for all.	2025-11-03 00:12:37 +05:30
notkshitij	1432c59bc4	Added code in markdown format for A6.	2025-11-03 00:04:26 +05:30
notkshitij	bb9a370a98	Added jupyter notebook for practical A6 and its dataset.	2025-11-03 00:04:08 +05:30
notkshitij	c4c460a81f	Fixed name for a4 code (in codes dir)	2025-11-03 00:00:04 +05:30
notkshitij	a0d06838c2	Added code for practical A4 in markdown format.	2025-11-02 23:34:41 +05:30
notkshitij	14d70779e9	Added jupyter notebook for practical A4.	2025-11-02 23:34:06 +05:30
notkshitij	8cf306ce2a	Added code in markdown format for code a2.	2025-11-02 22:03:28 +05:30
notkshitij	0e97128ff2	Added jupyter notebook and dataset for practical a2 (spam email detection)	2025-11-02 22:03:10 +05:30
notkshitij	5df648ac33	Added import libraries part.	2025-11-02 22:02:44 +05:30
notkshitij	e5347feffc	Changed dataset to dataset source in code a1	2025-11-02 21:59:18 +05:30
notkshitij	9c83c1aab2	Added name in title Code-A1.md	2025-11-02 21:49:16 +05:30
notkshitij	fc3b508b39	Added code in markdown format for practical A1 (uber ride)	2025-11-02 20:32:12 +05:30
notkshitij	ce5c95856d	Added Jupyter notebook and dataset for practical A1 (uber ride)	2025-11-02 20:31:52 +05:30