DataScienceAndBigDataAnalytics/Codes/Code-A5 (Data Analytics-2).md

# A5 - Data Analytics-2

---

## Pre-requisites

- Install required libraries: `pandas` & `scikit-learn`

```shell
pip install pandas
pip install -U scikit-learn
```

- Save the dataset [Social_Network_Ads.csv](https://git.kska.io/sppu-te-comp-content/DataScienceAndBigDataAnalytics/src/branch/main/Datasets/Social_Network_Ads.csv) in the same directory as this Jupyter notebook.

---

## Code blocks

1. Import libraries:

```shell
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
```

2. Load the dataset from a CSV file into a pandas DataFrame:

```shell
df = pd.read_csv("Social_Network_Ads.csv")
df["Gender"].replace({"Male":0,"Female":1}, inplace=True)
df
```

3. Print columns of the DataFrame:

```shell
df.columns
```

4. Defining the feature set (X) and the target variable (y):

```shell
x = df[['User ID', 'Gender', 'Age', 'EstimatedSalary']]
y = df[['Purchased']]
```

5. Splitting the dataset into training and testing sets (75% training, 25% testing):

```shell
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=29)
```

6. Creating an instance of the Logistic Regression model & fitting the model to the training data:

```shell
model = LogisticRegression()
model.fit(x_train,y_train)
```

7. Making and displaying predictions on the test set using the trained model:

```shell
y_pred = model.predict(x_test)
y_pred
```

8. Evaluating the model's performance on the training set:

```shell
model.score(x_train,y_train)
```

9. Evaluating the model's performance on the entire dataset:

```shell
model.score(x,y)
```

10. Generating and displaying the confusion matrix to evaluate the model's predictions:

```shell
cm = confusion_matrix(y_test,y_pred)
cm
```

11. Unpacking and printing the confusion matrix into true negatives (tn), false positives (fp), false negatives (fn), and true positives (tp):

```shell
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
print(tn,fp,fn,tp)
```

12. Calculating and displaying the accuracy score of the model on the test set:

```shell
a = accuracy_score(y_test,y_pred)
a
```

13. Calculating and displaying the error rate (1 - accuracy):

```shell
e = 1 - a
e
```

14. Calculating the precision score of the model:

```shell
precision_score(y_test,y_pred)
```

15. Calculating the recall score of the model:

```shell
recall_score(y_test,y_pred)
```

---

## References

1. [Jupyter notebook](https://github.com/ganimtron-10/SPPU-2019-TE-DSBDA-Lab/blob/master/Group-A/Q5.ipynb)
2. [Dataset source](https://www.kaggle.com/datasets/akram24/social-network-ads)

---