# A10 - Data Visualization-3 --- ## Pre-requisites - In the same directory as this Jupyter notebook, create a text file (eg. simple.txt) that contains some random text. --- 1. Import libraries ```python3 import pandas as pd import numpy as np import matplotlib.pyplot as plt import nltk ``` 2. Open a text file and download all NLTK resources ```python3 file=open('simple.txt','r') nltk.download('all') print(file) ``` 3. Read the content of the opened text file ```python3 content=file.read() print(content) ``` 4. Import the sentence tokenizer from the NLTK library ```python3 from nltk.tokenize import sent_tokenize ``` 5. Tokenize the content into sentences and print the result ```python3 sentence=sent_tokenize(content) print(sentence) ``` 6. Use a regular expression tokenizer to extract words from the content ```python3 from nltk.tokenize import RegexpTokenizer tokenizer=RegexpTokenizer(f"\w+") words=tokenizer.tokenize(content) print(words) ``` 7. Use a regular expression tokenizer to extract whitespace from the content ```python3 tokenizer=RegexpTokenizer(f"\s") words=tokenizer.tokenize(content) print(words) ``` 8. Import stopwords and word tokenizer from the NLTK library ```python3 from nltk.corpus import stopwords from nltk.tokenize import word_tokenize ``` 9. Retrieve and print the set of English stopwords ```python3 stopWords=set(stopwords.words('english')) print(stopWords) ``` 10. Tokenize each sentence, filter out stopwords, and perform POS tagging on the filtered words ```python3 for sen in sentence: Words=word_tokenize(sen) filteredWords=[word.lower() for word in Words if word.lower() not in stopWords] print(f"words without stopwords{filteredWords}") print(f"words with stopwords{Words}") print(f"POS Tagging{nltk.pos_tag(filteredWords)}") ``` 11. Print the POS tagging of the filtered words again (redundant) ```python3 print(f"POS Tagging{nltk.pos_tag(filteredWords)}") ``` 12. Import stemming and lemmatization tools from the NLTK library ```python3 from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer ``` 13. Apply stemming to each word and print the original and stemmed forms ```python3 stemmer=PorterStemmer() for word in Words: print(f"{word}- After Stemming = {stemmer.stem(word)}") ``` 14. Apply lemmatization to each word and print the original and lemmatized forms ```python3 lemmatizer=WordNetLemmatizer() for word in Words: print(f"{word}:{lemmatizer.lemmatize(word)}") ``` 15. Create a new sentence by joining the first three sentences from the original content ```python3 sentence=sentence[:3] new_sentence=[''.join(sentence)] new_sentence ``` 16. Import the TfidfVectorizer from the sklearn library for text feature extraction ```python3 from sklearn.feature_extraction.text import TfidfVectorizer ``` 17. Define a function to calculate the TF-IDF matrix and feature names from a document ```python3 def calculate_tfIdf(document): tokenizer=TfidfVectorizer() tf_matrix=tokenizer.fit_transform(document) features_names=tokenizer.get_feature_names_out() return tf_matrix,features_names ``` 18. Assign the newly created sentence to the document variable for TF-IDF calculation ```python3 document=new_sentence ``` 19. Calculate and print the TF-IDF matrix and feature names for the document ```python3 tf_matrix,feature_names=calculate_tfIdf(new_sentence) print('TFIDF') feature_names,tf_matrix.toarray() ``` ---