From 97ed1414ea6f46b194ae25658f9cee49c240b016 Mon Sep 17 00:00:00 2001 From: Kshitij Date: Fri, 28 Mar 2025 11:07:20 +0530 Subject: [PATCH] Added code for A5 (data analytics 2), i.e. logistic regression. --- Codes/Code-A5 (Data Analytics-2).md | 130 ++++++++++++++++++++++++++++ 1 file changed, 130 insertions(+) create mode 100644 Codes/Code-A5 (Data Analytics-2).md diff --git a/Codes/Code-A5 (Data Analytics-2).md b/Codes/Code-A5 (Data Analytics-2).md new file mode 100644 index 0000000..3204980 --- /dev/null +++ b/Codes/Code-A5 (Data Analytics-2).md @@ -0,0 +1,130 @@ +# A5 - Data Analytics-2 + +--- + +## Pre-requisites: + +- Install required libraries: `pandas` & `scikit-learn` + +```shell +pip install pandas +pip install -U scikit-learn +``` + +- Save the dataset [Social_Network_Ads.csv](https://git.kska.io/sppu-te-comp-content/DataScienceAndBigDataAnalytics/src/branch/main/Datasets/Social_Network_Ads.csv) in the same directory as this Jupyter notebook. + +--- + +## Code blocks: + +1. Import libraries: + +```shell +import pandas as pd + +from sklearn.model_selection import train_test_split +from sklearn.linear_model import LogisticRegression +from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score +``` + +2. Load the dataset from a CSV file into a pandas DataFrame: + +```shell +df = pd.read_csv("Social_Network_Ads.csv") +df["Gender"].replace({"Male":0,"Female":1}, inplace=True) +df +``` + +3. Print columns of the DataFrame: + +```shell +df.columns +``` + +4. Defining the feature set (X) and the target variable (y): + +```shell +x = df[['User ID', 'Gender', 'Age', 'EstimatedSalary']] +y = df[['Purchased']] +``` + +5. Splitting the dataset into training and testing sets (75% training, 25% testing): + +```shell +x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=29) +``` + +6. Creating an instance of the Logistic Regression model & fitting the model to the training data: + +```shell +model = LogisticRegression() +model.fit(x_train,y_train) +``` + +7. Making and displaying predictions on the test set using the trained model: + +```shell +y_pred = model.predict(x_test) +y_pred +``` + +8. Evaluating the model's performance on the training set: + +```shell +model.score(x_train,y_train) +``` + +9. Evaluating the model's performance on the entire dataset: + +```shell +model.score(x,y) +``` + +10. Generating and displaying the confusion matrix to evaluate the model's predictions: + +```shell +cm = confusion_matrix(y_test,y_pred) +cm +``` + +11. Unpacking and printing the confusion matrix into true negatives (tn), false positives (fp), false negatives (fn), and true positives (tp): + +```shell +tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel() +print(tn,fp,fn,tp) +``` + +12. Calculating and displaying the accuracy score of the model on the test set: + +```shell +a = accuracy_score(y_test,y_pred) +a +``` + +13. Calculating and displaying the error rate (1 - accuracy): + +```shell +e = 1 - a +e +``` + +14. Calculating the precision score of the model: + +```shell +precision_score(y_test,y_pred) +``` + +15. Calculating the recall score of the model: + +```shell +recall_score(y_test,y_pred) +``` + +--- + +## References + +1. [Jupyter notebook](https://github.com/ganimtron-10/SPPU-2019-TE-DSBDA-Lab/blob/master/Group-A/Q5.ipynb) +2. [Dataset source](https://www.kaggle.com/datasets/akram24/social-network-ads) + +---