0% found this document useful (0 votes)
21 views

NLP Record

The document discusses N-gram analysis in natural language processing. It defines unigrams, bigrams, and trigrams as sequences of 1, 2, or 3 words respectively. The document provides code to generate unigrams, bigrams, and trigrams from sample sentences using the NLTK ngrams function. It defines a custom ngrams_convertor function that takes a sentence and N as parameters to generate N-grams of any size.

Uploaded by

nuzzurockzz301
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

NLP Record

The document discusses N-gram analysis in natural language processing. It defines unigrams, bigrams, and trigrams as sequences of 1, 2, or 3 words respectively. The document provides code to generate unigrams, bigrams, and trigrams from sample sentences using the NLTK ngrams function. It defines a custom ngrams_convertor function that takes a sentence and N as parameters to generate N-grams of any size.

Uploaded by

nuzzurockzz301
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

1

WEEK-1
AIM: Installation and exploring features of NLTK and spaCy tools. Download word cloud and few corpora.

Installation of NLTK:
1. Go to link https://www.python.org/downloads/, and select the latest version for windows.
2. Click on the Downloaded File
3. Select Install Now.
4. After Set Up was Successful Click close.
5. In windows command prompt, Navigate to the location of the pip folder.
6. Enter command to install NLTK : pip install nltk
7. Installation should be done successfully

Features of NLTK:
Part of Speech tagging

Summarization

Named Entity Recognition

Sentiment Analysis

Emotion Detection

Language Detection

Data Ingestion and Wrangling

Programming Language Support

Drag and Drop

Customizable Models

Pre-Build Algorithms

Installation of spaCy:
Go to Command Prompt and Enter following commands to install spaCy.

pip install -U pip setuptools wheel

pip install -U spacy

python -m spacy download en_core_web_sm

Features of spaCy:
Parts of Speech tagging

Morphology

Lemmatization

Dependency Parse

Named Entities

Tokenization

Merging and Splitting


2

Sentence Segmentation etc..,

Downloading of word Cloud:


Using Command Prompt
WordCloud can be installed in our system by using the given command in the command prompt.

$ pip install wordcloud

Using Anaconda
We can install wordcloud using Anaconda by typing the following command in the Anaconda Prompt.

$ conda install -c conda-forge wordcloud

Downloading of few Corpora


We can Install Corpora by using python interpreter.

Run the Python interpreter and type the commands:

import nltk

nltk.download()

A new window should open, showing the NLTK Downloader. Click on the File menu and select Change
Download Directory. Next, select the packages or collections you want to download. After Successful
installation, we can test has been installed as follows:

from nltk.corpus import brown

brown.words()

Output:

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]


3

WEEK-2
AIM: i) Write a program to implement word tokenizer, sentence and paragraph tokenizer.

ii) Check how many words are there in any corpus.Also check how many distinct words are there?

DESCRIPTION:

Tokenization: Tokenization is the process by which a large quantity of text is divided into smaller parts called
tokens. These tokens are very for finding patterns and are considered as a base step for stemming and
lemmatization. Tokenization also helps to substitute sensitive data elements with non-sensitive data elements.

Natural Language toolkit has very important module NLTK tokenize sentences which further comprises of sub-
modules

1. word tokenize
2. sentence tokenize

We use the method word_tokenize() to split a sentence into words and sent_tokenize() to tokenize into
sentences.

PROGRAM:

I)
import nltk

nltk.download

OUTPUT:

<bound method Downloader.download of <nltk.downloader.Downloader object at 0x00000296F81E1730>>

from nltk import word_tokenize, sent_tokenize

text="Tokenization in NLP is the process by which a large quantity of text is divided into smaller parts called
tokens. Natural language processing is used for building applications such as Text classification, intelligent
chatbot, sentimental analysis, language translation, etc.Natural Language toolkit has very important module
NLTK tokenize sentence which further comprises of sub-modulesWe use the method word_tokenize() to split a
sentence into words. The output of word tokenizer in NLTK can be converted to Data Frame for better text
understanding in machine learning applications.Sub-module available for the above is sent_tokenize. Sentence
tokenizer in Python NLTK is an important feature for machine training."

print(word_tokenize(text))

OUTPUT:

['Tokenization', 'in', 'NLP', 'is', 'the', 'process', 'by', 'which', 'a', 'large', 'quantity', 'of', 'text', 'is', 'divided', 'into',
'smaller', 'parts', 'called', 'tokens', '.', 'Natural', 'language', 'processing', 'is', 'used', 'for', 'building', 'applications',
'such', 'as', 'Text', 'classification', ',', 'intelligent', 'chatbot', ',', 'sentimental', 'analysis', ',', 'language', 'translation',
',', 'etc.Natural', 'Language', 'toolkit', 'has', 'very', 'important', 'module', 'NLTK', 'tokenize', 'sentence', 'which',
'further', 'comprises', 'of', 'sub-modulesWe', 'use', 'the', 'method', 'word_tokenize', '(', ')', 'to', 'split', 'a', 'sentence',
'into', 'words', '.', 'The', 'output', 'of', 'word', 'tokenizer', 'in', 'NLTK', 'can', 'be', 'converted', 'to', 'Data', 'Frame',
'for', 'better', 'text', 'understanding', 'in', 'machine', 'learning', 'applications.Sub-module', 'available', 'for', 'the',
'above', 'is', 'sent_tokenize', '.', 'Sentence', 'tokenizer', 'in', 'Python', 'NLTK', 'is', 'an', 'important', 'feature', 'for',
'machine', 'training', '.']
4

print(sent_tokenize(text))

OUTPUT:

['Tokenization in NLP is the process by which a large quantity of text is divided into smaller parts called
tokens.', 'Natural language processing is used for building applications such as Text classification, intelligent
chatbot, sentimental analysis, language translation, etc.Natural Language toolkit has very important module
NLTK tokenize sentence which further comprises of sub-modulesWe use the method word_tokenize() to split a
sentence into words.', 'The output of word tokenizer in NLTK can be converted to Data Frame for better text
understanding in machine learning applications.Sub-module available for the above is sent_tokenize.', 'Sentence
tokenizer in Python NLTK is an important feature for machine training.']

II)

from nltk.corpus import movie_reviews

w = movie_reviews.words()

s = movie_reviews.sents()

from nltk.probability import FreqDist

count = FreqDist(w)

count.N()

OUTPUT:

1583820

print("No of Words in movie_reviews corpus are :",len(w))

print("No of Distinct words in movie_reviews corpus are : ",len(set(w)))

OUTPUT;

No of Words in movie_reviews corpus are : 1583820

No of Distinct words in movie_reviews corpus are : 39768


5

WEEK-3
AIM: i) Write a program to implement both user-defined and pre-defined functions to generate

a. Uni-grams
b. Bi-grams
c. Tri-grams
d. N-grams

DESCRIPTION:

N-grams represent a continuous sequence of N elements from a given set of texts. However, Natural Language
Processing commonly refers to N-grams as strings of words, where n stands for the number of words you are
looking for. The following types of N-grams are usually distinguished:

 Unigram — An N-gram with simply one string inside (for example, it can be a unique word —
computer or human from a given sentence, e.g. Natural Language Processing is the ability of a
computer program to understand human language as it is spoken and written).
 2-gram or Bigram — Typically a combination of two strings or words that appear in a document:
short-form video or video format will be likely a search result of bigrams in a certain corpus of texts
(and not format video, video short-form as the word order remains the same).
 3-gram or Trigram — An N-gram containing up to three elements processed together (e.g. short-form
video format or new short-form video) etc.

PROGRAM:

import nltk

from nltk.util import ngrams

n=1

sen = "Sravanthi is a good girl"

unigram = ngrams(sen.split(),n)

for i in unigram:

print(i)

('Sravanthi',)

('is',)

('a',)

('good',)

('girl',)

n=2

sen = "Sravanthi is a good girl"

bigram = ngrams(sen.split(),n)

for i in bigram:
6

print(i)

OUTPUT:

('Sravanthi', 'is')

('is', 'a')

('a', 'good')

('good', 'girl')

n=3

sen = "Sravanthi is a good girl"

trigram = ngrams(sen.split(),n)

for i in trigram:

print(i)

OUTPUT:

('Sravanthi', 'is', 'a')

('is', 'a', 'good')

('a', 'good', 'girl')

def ngrams_convertor(sen,n):

ngram = ngrams(sen.split(),n)

for i in ngram:

print(i)

sen = "Sravanthi is a good girl"

n=4

ngrams_convertor(sen,n)

OUTPUT:

('Sravanthi', 'is', 'a', 'good')

('is', 'a', 'good', 'girl')

You might also like