DataScienceAndBigDataAnalytics/Codes/Code-A7 (Text Analytics).md

3.4 KiB

A10 - Data Visualization-3


Pre-requisites

  • In the same directory as this Jupyter notebook, create a text file (eg. simple.txt) that contains some random text.

  1. Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
  1. Open a text file and download all NLTK resources
file=open('simple.txt','r')
nltk.download('all')
print(file)
  1. Read the content of the opened text file
content=file.read()
print(content)
  1. Import the sentence tokenizer from the NLTK library
from nltk.tokenize import sent_tokenize
  1. Tokenize the content into sentences and print the result
sentence=sent_tokenize(content)
print(sentence)
  1. Use a regular expression tokenizer to extract words from the content
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer(f"\w+")
words=tokenizer.tokenize(content)
print(words)
  1. Use a regular expression tokenizer to extract whitespace from the content
tokenizer=RegexpTokenizer(f"\s")
words=tokenizer.tokenize(content)
print(words)
  1. Import stopwords and word tokenizer from the NLTK library
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
  1. Retrieve and print the set of English stopwords
stopWords=set(stopwords.words('english'))
print(stopWords)
  1. Tokenize each sentence, filter out stopwords, and perform POS tagging on the filtered words
for sen in sentence:
	Words=word_tokenize(sen)
	filteredWords=[word.lower() for word in Words if word.lower() not in stopWords]
	print(f"words without stopwords{filteredWords}")
	print(f"words with stopwords{Words}")
	print(f"POS Tagging{nltk.pos_tag(filteredWords)}")
  1. Print the POS tagging of the filtered words again (redundant)
print(f"POS Tagging{nltk.pos_tag(filteredWords)}")
  1. Import stemming and lemmatization tools from the NLTK library
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
  1. Apply stemming to each word and print the original and stemmed forms
stemmer=PorterStemmer()
for word in Words:
	print(f"{word}- After Stemming = {stemmer.stem(word)}")
  1. Apply lemmatization to each word and print the original and lemmatized forms
lemmatizer=WordNetLemmatizer()
for word in Words:
	print(f"{word}:{lemmatizer.lemmatize(word)}")
  1. Create a new sentence by joining the first three sentences from the original content
sentence=sentence[:3]
new_sentence=[''.join(sentence)]
new_sentence
  1. Import the TfidfVectorizer from the sklearn library for text feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer
  1. Define a function to calculate the TF-IDF matrix and feature names from a document
def calculate_tfIdf(document):
	tokenizer=TfidfVectorizer()
	tf_matrix=tokenizer.fit_transform(document)
	features_names=tokenizer.get_feature_names_out()
	return tf_matrix,features_names
  1. Assign the newly created sentence to the document variable for TF-IDF calculation
document=new_sentence
  1. Calculate and print the TF-IDF matrix and feature names for the document
tf_matrix,feature_names=calculate_tfIdf(new_sentence)
print('TFIDF')
feature_names,tf_matrix.toarray()