158 lines
3.4 KiB
Markdown
158 lines
3.4 KiB
Markdown
# A10 - Data Visualization-3
|
|
|
|
---
|
|
|
|
## Pre-requisites
|
|
|
|
- In the same directory as this Jupyter notebook, create a text file (eg. simple.txt) that contains some random text.
|
|
|
|
---
|
|
|
|
1. Import libraries
|
|
|
|
```python3
|
|
import pandas as pd
|
|
import numpy as np
|
|
import matplotlib.pyplot as plt
|
|
import nltk
|
|
```
|
|
|
|
2. Open a text file and download all NLTK resources
|
|
|
|
```python3
|
|
file=open('simple.txt','r')
|
|
nltk.download('all')
|
|
print(file)
|
|
```
|
|
|
|
3. Read the content of the opened text file
|
|
|
|
```python3
|
|
content=file.read()
|
|
print(content)
|
|
```
|
|
|
|
4. Import the sentence tokenizer from the NLTK library
|
|
|
|
```python3
|
|
from nltk.tokenize import sent_tokenize
|
|
```
|
|
|
|
5. Tokenize the content into sentences and print the result
|
|
|
|
```python3
|
|
sentence=sent_tokenize(content)
|
|
print(sentence)
|
|
```
|
|
|
|
6. Use a regular expression tokenizer to extract words from the content
|
|
|
|
```python3
|
|
from nltk.tokenize import RegexpTokenizer
|
|
tokenizer=RegexpTokenizer(f"\w+")
|
|
words=tokenizer.tokenize(content)
|
|
print(words)
|
|
```
|
|
|
|
7. Use a regular expression tokenizer to extract whitespace from the content
|
|
|
|
```python3
|
|
tokenizer=RegexpTokenizer(f"\s")
|
|
words=tokenizer.tokenize(content)
|
|
print(words)
|
|
```
|
|
|
|
8. Import stopwords and word tokenizer from the NLTK library
|
|
|
|
```python3
|
|
from nltk.corpus import stopwords
|
|
from nltk.tokenize import word_tokenize
|
|
```
|
|
|
|
9. Retrieve and print the set of English stopwords
|
|
|
|
```python3
|
|
stopWords=set(stopwords.words('english'))
|
|
print(stopWords)
|
|
```
|
|
|
|
10. Tokenize each sentence, filter out stopwords, and perform POS tagging on the filtered words
|
|
|
|
```python3
|
|
for sen in sentence:
|
|
Words=word_tokenize(sen)
|
|
filteredWords=[word.lower() for word in Words if word.lower() not in stopWords]
|
|
print(f"words without stopwords{filteredWords}")
|
|
print(f"words with stopwords{Words}")
|
|
print(f"POS Tagging{nltk.pos_tag(filteredWords)}")
|
|
```
|
|
|
|
11. Print the POS tagging of the filtered words again (redundant)
|
|
|
|
```python3
|
|
print(f"POS Tagging{nltk.pos_tag(filteredWords)}")
|
|
```
|
|
|
|
12. Import stemming and lemmatization tools from the NLTK library
|
|
|
|
```python3
|
|
from nltk.stem import PorterStemmer
|
|
from nltk.stem import WordNetLemmatizer
|
|
```
|
|
|
|
13. Apply stemming to each word and print the original and stemmed forms
|
|
|
|
```python3
|
|
stemmer=PorterStemmer()
|
|
for word in Words:
|
|
print(f"{word}- After Stemming = {stemmer.stem(word)}")
|
|
```
|
|
|
|
14. Apply lemmatization to each word and print the original and lemmatized forms
|
|
|
|
```python3
|
|
lemmatizer=WordNetLemmatizer()
|
|
for word in Words:
|
|
print(f"{word}:{lemmatizer.lemmatize(word)}")
|
|
```
|
|
|
|
15. Create a new sentence by joining the first three sentences from the original content
|
|
|
|
```python3
|
|
sentence=sentence[:3]
|
|
new_sentence=[''.join(sentence)]
|
|
new_sentence
|
|
```
|
|
|
|
16. Import the TfidfVectorizer from the sklearn library for text feature extraction
|
|
|
|
```python3
|
|
from sklearn.feature_extraction.text import TfidfVectorizer
|
|
```
|
|
|
|
17. Define a function to calculate the TF-IDF matrix and feature names from a document
|
|
|
|
```python3
|
|
def calculate_tfIdf(document):
|
|
tokenizer=TfidfVectorizer()
|
|
tf_matrix=tokenizer.fit_transform(document)
|
|
features_names=tokenizer.get_feature_names_out()
|
|
return tf_matrix,features_names
|
|
```
|
|
|
|
18. Assign the newly created sentence to the document variable for TF-IDF calculation
|
|
|
|
```python3
|
|
document=new_sentence
|
|
```
|
|
|
|
19. Calculate and print the TF-IDF matrix and feature names for the document
|
|
|
|
```python3
|
|
tf_matrix,feature_names=calculate_tfIdf(new_sentence)
|
|
print('TFIDF')
|
|
feature_names,tf_matrix.toarray()
|
|
```
|
|
|
|
---
|