5.8 KiB
5.8 KiB
Practical-2b (Classification using Deep Neural Network - IMDB Dataset)
Problem Statement: Binary classification using Deep Neural Networks Example: Classify movie reviews into positive" reviews and "negative" reviews, just based on the text content of the reviews. Use IMDB dataset
Note
Dataset available in Datasets directory.
Pre-requisities
- Install packages using
pip:pip install tensorflow keras pandas numpy scikit-learn matplotlib seaborn(tensorflowrequires Python 3.9 - 3.12) - Copy the
IMDB Dataset.csvdataset in the same directory as the Jupyter notebook.
Steps
- Import Libraries
- Load Dataset
- Exploratory Data Analysis (EDA)
- Data Cleaning - Strip HTML Tags
- Encode Labels and Separate Features
- Tokenize and Pad Text Sequences
- Split into Training and Testing Sets
- Build the Neural Network Model
- Compile the Model
- Train the Model
- Evaluate the Model on Test Data
- Plot Training vs Validation Accuracy
- Plot Training vs Validation Loss
- Confusion Matrix and Classification Report
Code
1. Import Libraries:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
2. Load Dataset:
data = pd.read_csv('IMDB Dataset.csv')
print(data.head())
3. Exploratory Data Analysis (EDA):
print("Shape:", data.shape)
print("\nMissing Values:\n", data.isnull().sum())
print("\nClass Distribution:\n", data['sentiment'].value_counts())
# Visualize class distribution
sns.countplot(x='sentiment', data=data)
plt.title('Sentiment Class Distribution')
plt.show()
# Sample reviews
print("\nSample positive review:\n", data[data['sentiment'] == 'positive']['review'].iloc[0][:300])
print("\nSample negative review:\n", data[data['sentiment'] == 'negative']['review'].iloc[0][:300])
4. Data Cleaning - Strip HTML Tags:
def clean_text(text):
text = re.sub(r'<.*?>', '', text) # remove HTML tags like <br />
text = text.lower().strip() # lowercase and trim whitespace
return text
data['review'] = data['review'].apply(clean_text)
print("Sample cleaned review:\n", data['review'].iloc[0][:300])
5. Encode Labels and Separate Features:
label_encoder = LabelEncoder()
data['sentiment'] = label_encoder.fit_transform(data['sentiment']) # positive=1, negative=0
X = data['review'].values # input: review text
y = data['sentiment'].values # output: 0 or 1
6. Tokenize and Pad Text Sequences:
vocab_size = 10000 # keep only top 10,000 most frequent words
max_length = 200 # truncate/pad all reviews to 200 words
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>') # <OOV> handles unknown words
tokenizer.fit_on_texts(X) # build word index from training text
sequences = tokenizer.texts_to_sequences(X) # convert each word to its integer index
padded_sequences = pad_sequences(sequences, maxlen=max_length,
padding='post', truncating='post') # pad/truncate to fixed length
7. Split into Training and Testing Sets:
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, y, test_size=0.2, random_state=42)
8. Build the Neural Network Model:
model = Sequential()
model.add(Embedding(vocab_size, 16)) # maps each word index to a 16-dim vector
model.add(GlobalAveragePooling1D()) # averages all word vectors into one vector
model.add(Dense(24, activation='relu')) # hidden layer: 24 neurons
model.add(Dense(1, activation='sigmoid')) # output: probability between 0 and 1 (binary)
model.summary()
9. Compile the Model:
# binary_crossentropy: standard loss for binary classification; sigmoid output
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
10. Train the Model:
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)
11. Evaluate the Model on Test Data:
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy*100:.2f}%")
12. Plot Training vs Validation Accuracy:
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy Over Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)
plt.show()
13. Plot Training vs Validation Loss:
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss Over Epochs')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)
plt.show()
14. Confusion Matrix and Classification Report:
y_pred = (model.predict(X_test) > 0.5).astype(int) # threshold 0.5: prob > 0.5 = positive
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Negative', 'Positive'],
yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))