NLP Record
NLP Record
WEEK-1
AIM: Installation and exploring features of NLTK and spaCy tools. Download word cloud and few corpora.
Installation of NLTK:
1. Go to link https://www.python.org/downloads/, and select the latest version for windows.
2. Click on the Downloaded File
3. Select Install Now.
4. After Set Up was Successful Click close.
5. In windows command prompt, Navigate to the location of the pip folder.
6. Enter command to install NLTK : pip install nltk
7. Installation should be done successfully
Features of NLTK:
Part of Speech tagging
Summarization
Sentiment Analysis
Emotion Detection
Language Detection
Customizable Models
Pre-Build Algorithms
Installation of spaCy:
Go to Command Prompt and Enter following commands to install spaCy.
Features of spaCy:
Parts of Speech tagging
Morphology
Lemmatization
Dependency Parse
Named Entities
Tokenization
Using Anaconda
We can install wordcloud using Anaconda by typing the following command in the Anaconda Prompt.
import nltk
nltk.download()
A new window should open, showing the NLTK Downloader. Click on the File menu and select Change
Download Directory. Next, select the packages or collections you want to download. After Successful
installation, we can test has been installed as follows:
brown.words()
Output:
WEEK-2
AIM: i) Write a program to implement word tokenizer, sentence and paragraph tokenizer.
ii) Check how many words are there in any corpus.Also check how many distinct words are there?
DESCRIPTION:
Tokenization: Tokenization is the process by which a large quantity of text is divided into smaller parts called
tokens. These tokens are very for finding patterns and are considered as a base step for stemming and
lemmatization. Tokenization also helps to substitute sensitive data elements with non-sensitive data elements.
Natural Language toolkit has very important module NLTK tokenize sentences which further comprises of sub-
modules
1. word tokenize
2. sentence tokenize
We use the method word_tokenize() to split a sentence into words and sent_tokenize() to tokenize into
sentences.
PROGRAM:
I)
import nltk
nltk.download
OUTPUT:
text="Tokenization in NLP is the process by which a large quantity of text is divided into smaller parts called
tokens. Natural language processing is used for building applications such as Text classification, intelligent
chatbot, sentimental analysis, language translation, etc.Natural Language toolkit has very important module
NLTK tokenize sentence which further comprises of sub-modulesWe use the method word_tokenize() to split a
sentence into words. The output of word tokenizer in NLTK can be converted to Data Frame for better text
understanding in machine learning applications.Sub-module available for the above is sent_tokenize. Sentence
tokenizer in Python NLTK is an important feature for machine training."
print(word_tokenize(text))
OUTPUT:
['Tokenization', 'in', 'NLP', 'is', 'the', 'process', 'by', 'which', 'a', 'large', 'quantity', 'of', 'text', 'is', 'divided', 'into',
'smaller', 'parts', 'called', 'tokens', '.', 'Natural', 'language', 'processing', 'is', 'used', 'for', 'building', 'applications',
'such', 'as', 'Text', 'classification', ',', 'intelligent', 'chatbot', ',', 'sentimental', 'analysis', ',', 'language', 'translation',
',', 'etc.Natural', 'Language', 'toolkit', 'has', 'very', 'important', 'module', 'NLTK', 'tokenize', 'sentence', 'which',
'further', 'comprises', 'of', 'sub-modulesWe', 'use', 'the', 'method', 'word_tokenize', '(', ')', 'to', 'split', 'a', 'sentence',
'into', 'words', '.', 'The', 'output', 'of', 'word', 'tokenizer', 'in', 'NLTK', 'can', 'be', 'converted', 'to', 'Data', 'Frame',
'for', 'better', 'text', 'understanding', 'in', 'machine', 'learning', 'applications.Sub-module', 'available', 'for', 'the',
'above', 'is', 'sent_tokenize', '.', 'Sentence', 'tokenizer', 'in', 'Python', 'NLTK', 'is', 'an', 'important', 'feature', 'for',
'machine', 'training', '.']
4
print(sent_tokenize(text))
OUTPUT:
['Tokenization in NLP is the process by which a large quantity of text is divided into smaller parts called
tokens.', 'Natural language processing is used for building applications such as Text classification, intelligent
chatbot, sentimental analysis, language translation, etc.Natural Language toolkit has very important module
NLTK tokenize sentence which further comprises of sub-modulesWe use the method word_tokenize() to split a
sentence into words.', 'The output of word tokenizer in NLTK can be converted to Data Frame for better text
understanding in machine learning applications.Sub-module available for the above is sent_tokenize.', 'Sentence
tokenizer in Python NLTK is an important feature for machine training.']
II)
w = movie_reviews.words()
s = movie_reviews.sents()
count = FreqDist(w)
count.N()
OUTPUT:
1583820
OUTPUT;
WEEK-3
AIM: i) Write a program to implement both user-defined and pre-defined functions to generate
a. Uni-grams
b. Bi-grams
c. Tri-grams
d. N-grams
DESCRIPTION:
N-grams represent a continuous sequence of N elements from a given set of texts. However, Natural Language
Processing commonly refers to N-grams as strings of words, where n stands for the number of words you are
looking for. The following types of N-grams are usually distinguished:
Unigram — An N-gram with simply one string inside (for example, it can be a unique word —
computer or human from a given sentence, e.g. Natural Language Processing is the ability of a
computer program to understand human language as it is spoken and written).
2-gram or Bigram — Typically a combination of two strings or words that appear in a document:
short-form video or video format will be likely a search result of bigrams in a certain corpus of texts
(and not format video, video short-form as the word order remains the same).
3-gram or Trigram — An N-gram containing up to three elements processed together (e.g. short-form
video format or new short-form video) etc.
PROGRAM:
import nltk
n=1
unigram = ngrams(sen.split(),n)
for i in unigram:
print(i)
('Sravanthi',)
('is',)
('a',)
('good',)
('girl',)
n=2
bigram = ngrams(sen.split(),n)
for i in bigram:
6
print(i)
OUTPUT:
('Sravanthi', 'is')
('is', 'a')
('a', 'good')
('good', 'girl')
n=3
trigram = ngrams(sen.split(),n)
for i in trigram:
print(i)
OUTPUT:
def ngrams_convertor(sen,n):
ngram = ngrams(sen.split(),n)
for i in ngram:
print(i)
n=4
ngrams_convertor(sen,n)
OUTPUT: