A10 - Data Visualization-3

Pre-requisites

In the same directory as this Jupyter notebook, create a text file (eg. simple.txt) that contains some random text.

Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk

Open a text file and download all NLTK resources

file=open('simple.txt','r')
nltk.download('all')
print(file)

Read the content of the opened text file

content=file.read()
print(content)

Import the sentence tokenizer from the NLTK library

from nltk.tokenize import sent_tokenize

Tokenize the content into sentences and print the result

sentence=sent_tokenize(content)
print(sentence)

Use a regular expression tokenizer to extract words from the content

from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer(f"\w+")
words=tokenizer.tokenize(content)
print(words)

Use a regular expression tokenizer to extract whitespace from the content

tokenizer=RegexpTokenizer(f"\s")
words=tokenizer.tokenize(content)
print(words)

Import stopwords and word tokenizer from the NLTK library

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

Retrieve and print the set of English stopwords

stopWords=set(stopwords.words('english'))
print(stopWords)

Tokenize each sentence, filter out stopwords, and perform POS tagging on the filtered words

for sen in sentence:
	Words=word_tokenize(sen)
	filteredWords=[word.lower() for word in Words if word.lower() not in stopWords]
	print(f"words without stopwords{filteredWords}")
	print(f"words with stopwords{Words}")
	print(f"POS Tagging{nltk.pos_tag(filteredWords)}")

Print the POS tagging of the filtered words again (redundant)

print(f"POS Tagging{nltk.pos_tag(filteredWords)}")

Import stemming and lemmatization tools from the NLTK library

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

Apply stemming to each word and print the original and stemmed forms

stemmer=PorterStemmer()
for word in Words:
	print(f"{word}- After Stemming = {stemmer.stem(word)}")

Apply lemmatization to each word and print the original and lemmatized forms

lemmatizer=WordNetLemmatizer()
for word in Words:
	print(f"{word}:{lemmatizer.lemmatize(word)}")

Create a new sentence by joining the first three sentences from the original content

sentence=sentence[:3]
new_sentence=[''.join(sentence)]
new_sentence

Import the TfidfVectorizer from the sklearn library for text feature extraction

from sklearn.feature_extraction.text import TfidfVectorizer

Define a function to calculate the TF-IDF matrix and feature names from a document

def calculate_tfIdf(document):
	tokenizer=TfidfVectorizer()
	tf_matrix=tokenizer.fit_transform(document)
	features_names=tokenizer.get_feature_names_out()
	return tf_matrix,features_names

Assign the newly created sentence to the document variable for TF-IDF calculation

document=new_sentence

Calculate and print the TF-IDF matrix and feature names for the document

tf_matrix,feature_names=calculate_tfIdf(new_sentence)
print('TFIDF')
feature_names,tf_matrix.toarray()

3.4 KiB Raw Blame History

A10 - Data Visualization-3

Pre-requisites

3.4 KiB

Raw Blame History