Files

5.8 KiB

Practical-2b (Classification using Deep Neural Network - IMDB Dataset)

Problem Statement: Binary classification using Deep Neural Networks Example: Classify movie reviews into positive" reviews and "negative" reviews, just based on the text content of the reviews. Use IMDB dataset

Note

Dataset available in Datasets directory.


Pre-requisities

  1. Install packages using pip: pip install tensorflow keras pandas numpy scikit-learn matplotlib seaborn (tensorflow requires Python 3.9 - 3.12)
  2. Copy the IMDB Dataset.csv dataset in the same directory as the Jupyter notebook.

Steps

  1. Import Libraries
  2. Load Dataset
  3. Exploratory Data Analysis (EDA)
  4. Data Cleaning - Strip HTML Tags
  5. Encode Labels and Separate Features
  6. Tokenize and Pad Text Sequences
  7. Split into Training and Testing Sets
  8. Build the Neural Network Model
  9. Compile the Model
  10. Train the Model
  11. Evaluate the Model on Test Data
  12. Plot Training vs Validation Accuracy
  13. Plot Training vs Validation Loss
  14. Confusion Matrix and Classification Report

Code

1. Import Libraries:

import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2. Load Dataset:

data = pd.read_csv('IMDB Dataset.csv')
print(data.head())

3. Exploratory Data Analysis (EDA):

print("Shape:", data.shape)
print("\nMissing Values:\n", data.isnull().sum())
print("\nClass Distribution:\n", data['sentiment'].value_counts())

# Visualize class distribution
sns.countplot(x='sentiment', data=data)
plt.title('Sentiment Class Distribution')
plt.show()

# Sample reviews
print("\nSample positive review:\n", data[data['sentiment'] == 'positive']['review'].iloc[0][:300])
print("\nSample negative review:\n", data[data['sentiment'] == 'negative']['review'].iloc[0][:300])

4. Data Cleaning - Strip HTML Tags:

def clean_text(text):
    text = re.sub(r'<.*?>', '', text)   # remove HTML tags like <br />
    text = text.lower().strip()         # lowercase and trim whitespace
    return text

data['review'] = data['review'].apply(clean_text)
print("Sample cleaned review:\n", data['review'].iloc[0][:300])

5. Encode Labels and Separate Features:

label_encoder = LabelEncoder()
data['sentiment'] = label_encoder.fit_transform(data['sentiment'])  # positive=1, negative=0

X = data['review'].values    # input: review text
y = data['sentiment'].values # output: 0 or 1

6. Tokenize and Pad Text Sequences:

vocab_size = 10000   # keep only top 10,000 most frequent words
max_length = 200     # truncate/pad all reviews to 200 words

tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')  # <OOV> handles unknown words
tokenizer.fit_on_texts(X)                        # build word index from training text

sequences = tokenizer.texts_to_sequences(X)      # convert each word to its integer index
padded_sequences = pad_sequences(sequences, maxlen=max_length,
                                 padding='post', truncating='post')  # pad/truncate to fixed length

7. Split into Training and Testing Sets:

X_train, X_test, y_train, y_test = train_test_split(padded_sequences, y, test_size=0.2, random_state=42)

8. Build the Neural Network Model:

model = Sequential()
model.add(Embedding(vocab_size, 16))        # maps each word index to a 16-dim vector
model.add(GlobalAveragePooling1D())          # averages all word vectors into one vector
model.add(Dense(24, activation='relu'))      # hidden layer: 24 neurons
model.add(Dense(1, activation='sigmoid'))    # output: probability between 0 and 1 (binary)

model.summary()

9. Compile the Model:

# binary_crossentropy: standard loss for binary classification; sigmoid output
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

10. Train the Model:

history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

11. Evaluate the Model on Test Data:

loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy*100:.2f}%")

12. Plot Training vs Validation Accuracy:

plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy Over Epochs')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)
plt.show()

13. Plot Training vs Validation Loss:

plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss Over Epochs')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)
plt.show()

14. Confusion Matrix and Classification Report:

y_pred = (model.predict(X_test) > 0.5).astype(int)  # threshold 0.5: prob > 0.5 = positive

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

Miscellaneous