0% found this document useful (0 votes)

10 views1 page

Dsbda 7

The document discusses various natural language processing techniques including tokenization, stop word removal, stemming, lemmatization and part-of-speech tagging using NLTK and TextBlob. It also discusses creating word sets from documents, calculating term frequency-inverse document frequency for text representation.

Uploaded by

monaliauti2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views1 page

Dsbda 7

Uploaded by

monaliauti2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

In [17]:

# Tokenization using NLTK

from nltk import word_tokenize, sent_tokenize
sent = "Tokenization refers to break down the text into smaller units ."
print(word_tokenize(sent))
print(sent_tokenize(sent))

['Tokenization', 'refers', 'to', 'break', 'down', 'the', 'text', 'into', 'smaller', 'units', '.']
['Tokenization refers to break down the text into smaller units .']

In [29]:
#Text to tokenize
text = "This is a tokenize test"

In [30]:
from nltk.tokenize import word_tokenize
word_tokenize(text)

['This', 'is', 'a', 'tokenize', 'test']

Out[30]:

In [19]:
#STOP WORD REMOVAL|
text = "S&P and NASDAQ are the two most popular indices in US"

In [20]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
text_tokens = word_tokenize(text)
tokens_without_sw= [word for word in text_tokens if not word in stop_words]

print(tokens_without_sw)

['S', '&', 'P', 'NASDAQ', 'two', 'popular', 'indices', 'US']

In [10]:
#steamming
text = "It's a Stemming testing"

In [11]:
parsed_text = word_tokenize(text)

In [22]:
# Initialize stemmer.
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

# Stem each word.

[(word, stemmer.stem(word)) for i, word in enumerate(parsed_text)
if word.lower() != stemmer.stem(parsed_text[i])]

[('Stemming', 'stem'), ('testing', 'test')]

Out[22]:

In [26]:
from nltk.stem import PorterStemmer

# create an object of class PorterStemmer

porter = PorterStemmer()
print(porter.stem("play"))
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("played"))

play
play
play
play

In [23]:
#lemmatizenation
text = "This world has a lot of faces "

In [24]:
from textblob import Word
parsed_data= TextBlob(text).words
parsed_data

WordList(['This', 'world', 'has', 'a', 'lot', 'of', 'faces'])

Out[24]:

In [25]:
[(word, word.lemmatize()) for i, word in enumerate(parsed_data)
if word != parsed_data[i].lemmatize()]

[('has', 'ha'), ('faces', 'face')]

Out[25]:

In [27]:
#pos Tagging
text = 'Google is looking at buying U.K. startup for $1 billion'

In [28]:
TextBlob(text).tags

[('Google', 'NNP'),
Out[28]:
('is', 'VBZ'),
('looking', 'VBG'),
('at', 'IN'),
('buying', 'VBG'),
('U.K.', 'NNP'),
('startup', 'NN'),
('for', 'IN'),
('1', 'CD'),
('billion', 'CD')]

In [36]:
import pandas as pd
import numpy as np

In [8]:
#CREATE WORD SET FOR CORPUS
corpus = ['data science is one of the most important fields of science',
'this is one of the best data science courses',
'data scientists analyze data' ]

In [9]:
words_set = set()

for doc in corpus:

words = doc.split(' ')
words_set = words_set.union(set(words))

print('Number of words in the corpus:',len(words_set))

print('The words in the corpus: \n', words_set)

Number of words in the corpus: 14

The words in the corpus:
{'the', 'analyze', 'courses', 'scientists', 'science', 'one', 'fields', 'best', 'important', 'is', 'most', 'this', 'data', 'of'}

In [12]:
import math
from collections import Counter

def calculate_tf(text):
words = text.split()
word_count = Counter(words)
total_words = len(words)
tf = {word: word_count[word] / total_words for word in word_count}
return tf

In [13]:
def calculate_idf(documents):
total_docs = len(documents)
idf = {}
for doc in documents:
words = set(doc.split())
for word in words:
idf[word] = idf.get(word, 0) + 1
for word in idf:
idf[word] = math.log(total_docs / (idf[word] + 1)) # Adding 1 to avoid division by zero
return idf

In [14]:
def calculate_tfidf(tf, idf):
tfidf = {word: tf[word] * idf.get(word, 0) for word in tf}
return tfidf

In [15]:
def represent_document(document, idf):
tf = calculate_tf(document)
tfidf = calculate_tfidf(tf, idf)
return tfidf

In [16]:
documents = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]

idf = calculate_idf(documents)
document_representation = represent_document(documents[0], idf)
print(document_representation)

{'This': 0.05753641449035617, 'is': 0.0, 'the': -0.044628710262841945, 'first': 0.05753641449035617, 'document.': 0.05753641449035617}

In [ ]:

Montessori For Babies Timeline
100% (3)
Montessori For Babies Timeline
2 pages
34 David VS Agbay and People
100% (5)
34 David VS Agbay and People
2 pages
The Effect of Slum On Property Values in Asaba Metropolis of Delta State
No ratings yet
The Effect of Slum On Property Values in Asaba Metropolis of Delta State
18 pages
NLP Projects
No ratings yet
NLP Projects
4 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
Final_NLP_Lab_File
No ratings yet
Final_NLP_Lab_File
28 pages
DS 7
No ratings yet
DS 7
3 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
x0 Process
No ratings yet
x0 Process
4 pages
Sahil NLP
No ratings yet
Sahil NLP
16 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
Assignment No - 7
No ratings yet
Assignment No - 7
4 pages
NLP Lab Assignment 8
No ratings yet
NLP Lab Assignment 8
14 pages
Clean Data
No ratings yet
Clean Data
4 pages
NLP_Record(Weeks 1-12) (1)
No ratings yet
NLP_Record(Weeks 1-12) (1)
41 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
ASTW RA03 PracticalManual
No ratings yet
ASTW RA03 PracticalManual
18 pages
3
No ratings yet
3
5 pages
Bling
No ratings yet
Bling
7 pages
NLP FinAL (1)
No ratings yet
NLP FinAL (1)
27 pages
Record
No ratings yet
Record
6 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
5 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
7.TextAnalysis
No ratings yet
7.TextAnalysis
3 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
NLP (1)
No ratings yet
NLP (1)
12 pages
NLP_record_2[10] (1)
No ratings yet
NLP_record_2[10] (1)
18 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
DL_5
No ratings yet
DL_5
9 pages
Python Assignment 3
No ratings yet
Python Assignment 3
3 pages
NLP PRATICAL
No ratings yet
NLP PRATICAL
14 pages
HMM Pos Tagging 05
No ratings yet
HMM Pos Tagging 05
1 page
NLP Assignment(917722H031)
No ratings yet
NLP Assignment(917722H031)
18 pages
Soundarya 256 NLP Practs
No ratings yet
Soundarya 256 NLP Practs
14 pages
Lab1 IR
No ratings yet
Lab1 IR
14 pages
Aped For Fake News
No ratings yet
Aped For Fake News
6 pages
EX1
No ratings yet
EX1
6 pages
NLP Cheatsheet
100% (2)
NLP Cheatsheet
18 pages
Laboratory Manual: Faculty of Engineering and Technology Bachelor of Technology
No ratings yet
Laboratory Manual: Faculty of Engineering and Technology Bachelor of Technology
10 pages
Self Evaluation Exercises (1)
No ratings yet
Self Evaluation Exercises (1)
12 pages
Bag of Words 03 and 04 Model
No ratings yet
Bag of Words 03 and 04 Model
4 pages
Untitled
No ratings yet
Untitled
3 pages
NLP Using Python
No ratings yet
NLP Using Python
4 pages
NLP_crecord_mid2
No ratings yet
NLP_crecord_mid2
36 pages
C24064_NLP_LAB MANUAL
No ratings yet
C24064_NLP_LAB MANUAL
28 pages
NLP - Practical List
No ratings yet
NLP - Practical List
14 pages
Sumati
No ratings yet
Sumati
10 pages
C3_W1
No ratings yet
C3_W1
71 pages
AP19110010110 Lab Assignment-2 - Jupyter Notebook
No ratings yet
AP19110010110 Lab Assignment-2 - Jupyter Notebook
18 pages
NLP LAB_MANUAL (1)
No ratings yet
NLP LAB_MANUAL (1)
33 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
Module III
No ratings yet
Module III
42 pages
NLP Lab Manual (1)
No ratings yet
NLP Lab Manual (1)
19 pages
R22 Nlp Python Programs
No ratings yet
R22 Nlp Python Programs
15 pages
a7 dsbda sana
No ratings yet
a7 dsbda sana
15 pages
1st
No ratings yet
1st
3 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
ch-4 Agriculture Part-1
100% (1)
ch-4 Agriculture Part-1
4 pages
Xi Science and Social:: English Subject of Poris Indah School
No ratings yet
Xi Science and Social:: English Subject of Poris Indah School
39 pages
The Last Leaf Notes 2021-22
No ratings yet
The Last Leaf Notes 2021-22
4 pages
Reformed Apologetics Ministries - A Critique of Shabir Ally's Debate Tactics Pt. 1b
No ratings yet
Reformed Apologetics Ministries - A Critique of Shabir Ally's Debate Tactics Pt. 1b
5 pages
Sociolinguistics
No ratings yet
Sociolinguistics
5 pages
Blackout Poetry
No ratings yet
Blackout Poetry
3 pages
Labor Code of The Philippines
No ratings yet
Labor Code of The Philippines
2 pages
OLEANDER BAN
No ratings yet
OLEANDER BAN
4 pages
Woman Muslim Leaders
No ratings yet
Woman Muslim Leaders
195 pages
PGP Dsba Brochure
No ratings yet
PGP Dsba Brochure
18 pages
Lesson 5 The Immune System
No ratings yet
Lesson 5 The Immune System
30 pages
Letter Requesting Partnership With An Organization (AutoRecovered)
No ratings yet
Letter Requesting Partnership With An Organization (AutoRecovered)
21 pages
Classification of Credits Under The Civil Code
100% (2)
Classification of Credits Under The Civil Code
6 pages
Java How to Program Early Objects 10th Edition Deitel Test Bankinstant download
100% (8)
Java How to Program Early Objects 10th Edition Deitel Test Bankinstant download
44 pages
Treasure Hunters OSR
75% (4)
Treasure Hunters OSR
44 pages
VHDL Notes PDF
100% (1)
VHDL Notes PDF
10 pages
BUS 5411 - Written Assignment Week 2
100% (1)
BUS 5411 - Written Assignment Week 2
3 pages
Engleski Jezik Vjezbe
100% (1)
Engleski Jezik Vjezbe
3 pages
ICT HE Mozambique
No ratings yet
ICT HE Mozambique
18 pages
Viner 2005 Forensic Radiography in South Africa
No ratings yet
Viner 2005 Forensic Radiography in South Africa
24 pages
Chap 003
No ratings yet
Chap 003
76 pages
Violence and Harassment in The Workplace
No ratings yet
Violence and Harassment in The Workplace
6 pages
[Ebooks PDF] download Managing Sport Facilities and Major Events Hans Westerbeek full chapters
100% (9)
[Ebooks PDF] download Managing Sport Facilities and Major Events Hans Westerbeek full chapters
50 pages
Dormant Account Form 27 August 2021
No ratings yet
Dormant Account Form 27 August 2021
1 page
SG4-Focus On Classroom
No ratings yet
SG4-Focus On Classroom
5 pages
Solution Manual for College Physics 7th Edition by Wilson - Available With All Chapters For Instant Download
100% (16)
Solution Manual for College Physics 7th Edition by Wilson - Available With All Chapters For Instant Download
51 pages
SCM 012-130 Sae
No ratings yet
SCM 012-130 Sae
12 pages