Added codes, datasets and Jupyter notebooks directory.
This commit is contained in:
@@ -0,0 +1,160 @@
|
||||
# A3 - Descriptive Statistics
|
||||
|
||||
✅ Tested and working as intended.
|
||||
|
||||
---
|
||||
|
||||
## Pre-requisites
|
||||
|
||||
- Install required libraries: `pandas`
|
||||
|
||||
```shell
|
||||
pip install pandas
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Code blocks
|
||||
|
||||
### Problem Statement - Part 1 (data.csv)
|
||||
|
||||
1. Import library
|
||||
|
||||
```python3
|
||||
import pandas as pd
|
||||
```
|
||||
|
||||
2. Generate data and load into DataFrame:
|
||||
|
||||
```python3
|
||||
# Generate data
|
||||
data = {
|
||||
'age': [25, 30, 22, 40, 55, 60, 33, 28, 45, 50],
|
||||
'income': [50000, 60000, 45000, 70000, 80000, 90000, 65000, 62000, 75000, 85000],
|
||||
'age_group': ['20-30', '30-40', '20-30', '40-50', '50-60', '50-60', '30-40', '20-30', '40-50', '50-60']
|
||||
}
|
||||
|
||||
# Define data in DataFrame
|
||||
df = pd.DataFrame(data)
|
||||
```
|
||||
|
||||
3. Group data by `age_group`, compute statistics for `income` + print:
|
||||
|
||||
```python3
|
||||
# Group the data by age_group and compute summary statistics for 'income'
|
||||
summary_stats = df.groupby('age_group')['income'].describe()
|
||||
|
||||
# Print summary
|
||||
print(summary_stats)
|
||||
```
|
||||
|
||||
4. Group the data by `age_group`; Select `income` column for each of the groups created; Calculate median for `income`:
|
||||
|
||||
```python3
|
||||
# Group the data by age_group; Select income column for each of the groups created; Calculate median for income
|
||||
median_income = df.groupby('age_group')['income'].median()
|
||||
|
||||
# Print dat median
|
||||
print("Median Income by Age Group:")
|
||||
print(median_income)
|
||||
```
|
||||
|
||||
5. Print column names:
|
||||
|
||||
```python3
|
||||
print("Column Names:", df.columns)
|
||||
```
|
||||
|
||||
6. Modified dataset with repeated values; define in DataFrame:
|
||||
|
||||
```python3
|
||||
# Modified dataset with repeated values
|
||||
data = {
|
||||
'age': [25, 30, 25, 40, 55, 60, 33, 28, 45, 50, 25, 30, 28, 30, 25],
|
||||
'income': [50000, 60000, 50000, 70000, 80000, 90000, 65000, 62000, 75000, 85000, 50000, 60000, 62000, 70000, 75000],
|
||||
'age_group': ['20-30', '30-40', '20-30', '40-50', '50-60', '50-60', '30-40', '20-30', '40-50', '50-60', '20-30', '30-40', '20-30', '30-40', '20-30']
|
||||
}
|
||||
|
||||
# Define data in DataFrame
|
||||
df = pd.DataFrame(data)
|
||||
```
|
||||
|
||||
7. Calculate mode:
|
||||
|
||||
```python3
|
||||
# Calculate the mode for each column
|
||||
mode_age = df['age'].mode()
|
||||
mode_income = df['income'].mode()
|
||||
print(f"Mode of Age: {mode_age.values}")
|
||||
print(f"Mode of Income: {mode_income.values}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Problem Statement - Part 2 (iris.csv)
|
||||
|
||||
- Save the dataset [iris.csv](https://git.kska.io/sppu-te-comp-content/DataScienceAndBigDataAnalytics/src/branch/main/Datasets/iris.csv) in the same directory as this Jupyter notebook.
|
||||
|
||||
1. Load dataset and print first 5 rows:
|
||||
|
||||
```python3
|
||||
# Load iris.csv in the DataFrame
|
||||
df = pd.read_csv('iris.csv')
|
||||
|
||||
print(df.head()) # Print first 5 columns
|
||||
```
|
||||
|
||||
2. Group data; Compute percentiles; Display:
|
||||
|
||||
```python3
|
||||
# Group the data by species and display summary statistics
|
||||
summary_stats_species = df.groupby('variety').describe()
|
||||
|
||||
# Compute specific percentiles and statistics
|
||||
percentiles = df.groupby('variety').quantile([0.25, 0.5, 0.75])
|
||||
|
||||
# Display summary statistics and percentiles
|
||||
summary_stats_species = df.groupby('variety').describe()
|
||||
|
||||
print("\nPercentiles by Species:")
|
||||
print(percentiles)
|
||||
```
|
||||
|
||||
3. Group the data by variety; Select sepal.width column for each of the groups created; Display summary statistics:
|
||||
|
||||
```python3
|
||||
# Group the data by variety; Select sepal.width column for each of the groups created; Display summary statistics
|
||||
summary_stats_species = df.groupby('variety')['sepal.width'].describe()
|
||||
|
||||
print("\nSummary Statistics by Species for Sepal Width:")
|
||||
print(summary_stats_species)
|
||||
```
|
||||
|
||||
4. Group by variety and compute the median for numeric columns:
|
||||
|
||||
```python3
|
||||
# Group by variety and compute the median for numeric columns
|
||||
median_values = df.groupby('variety').median()
|
||||
|
||||
print("Median Values by Species:")
|
||||
print(median_values)
|
||||
```
|
||||
|
||||
5. Group the data by variety; Select sepal.width column for each of the groups created; Display median:
|
||||
|
||||
```python3
|
||||
# Group the data by variety; Select sepal.width column for each of the groups created; Display median
|
||||
median_sepal_length = df.groupby('variety')['sepal.length'].median()
|
||||
print("Median Sepal Length by Species:")
|
||||
print(median_sepal_length)
|
||||
```
|
||||
|
||||
6. Calculate & print mode for sepal.width:
|
||||
|
||||
```python3
|
||||
# Calculate & print mode for sepal.width
|
||||
mode_width = df['sepal.width'].mode()
|
||||
print(f"Mode of Width: {mode_width.values}")
|
||||
```
|
||||
|
||||
---
|
||||
Reference in New Issue
Block a user