0% found this document useful (0 votes)
2 views

p4

The document outlines a practical assignment focused on data preprocessing using the NLTK library in Python, including tasks such as tokenization, stemming, part-of-speech tagging, stop words removal, and spell checking. It specifies objectives, expected outcomes, prerequisites, and provides a detailed program logic flow for implementing these tasks. Additionally, it includes safety precautions, resources required, assessment rubrics, and references for further learning.

Uploaded by

Hetal Vasava
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

p4

The document outlines a practical assignment focused on data preprocessing using the NLTK library in Python, including tasks such as tokenization, stemming, part-of-speech tagging, stop words removal, and spell checking. It specifies objectives, expected outcomes, prerequisites, and provides a detailed program logic flow for implementing these tasks. Additionally, it includes safety precautions, resources required, assessment rubrics, and references for further learning.

Uploaded by

Hetal Vasava
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Foundation of AI and ML (4351601)

Date: ……………
Practical No.4: Perform following data preprocessing on text/paragraph using NLTK
library:
a. Write a Python program to tokenize words, sentence wise.
b. Write a python program that accepts the list of tokenized word and
stems it into root word.
c. Write a program in python to identify the part of speech for each word
in the text.
d. Write a Python NLTK program to remove stop words from a given text.
e. Write a python program for identifying and correcting misspelled
words in a given text, such as an essay or a letter.

A. Objective: Learn data pre-processing using NLTK library to write the python
program.
B. Expected Program Outcomes (POs): PO1, PO2, PO3, PO4, PO5, PO6, PO7
C. Expected Skills to be developed based on competency:
 Able to apply data preprocessing on text/paragraph using NLTK library.

D. Expected Course Outcomes(Cos)


CO4
E. Practical Outcome(PRo)
The program demonstrates the usage by tokenizing the example text and prints
each tokenized sentence.
F. Expected Affective domain Outcome(ADos)
Follow ethical practices
G. Prerequisite Theory:
NLTK (Natural Language Toolkit) is a widely used library for natural language
processing (NLP) in Python. It provides a wide range of functionalities and
resources for tasks such as tokenization, stemming, part-of-speech tagging,
syntactic and semantic analysis, and much more. Here is an overview of the key
components and capabilities of NLTK:
Foundation of AI and ML (4351601)

 Tokenization: NLTK offers tokenization functions to break down text into


individual words or sentences. It provides methods like word_tokenize() and
sent_tokenize() to split text accordingly.
 Stemming: Stemming is the process of reducing words to their base or root
form. NLTK includes several stemmers, such as the Porter Stemmer and the
Snowball Stemmer, which can be used to perform stemming operations on
words.
 Part-of-Speech (POS) Tagging: POS tagging assigns grammatical tags to words
based on their context and role in a sentence. NLTK provides the pos_tag()
function, which uses pre-trained models to identify and tag the part of speech
for each word in a given text.
 Stop Words Removal: Stop words are common words like "a," "the," "and,"
etc., that often carry little or no meaningful information. NLTK includes a
corpus of stop words for various languages. You can use this corpus to filter
out stop words from your text and focus on more relevant words.
 Named Entity Recognition (NER): NLTK offers NER capabilities to identify and
classify named entities in text, such as names of persons, organizations,
locations, and other specified categories.
 Syntax and Semantic Analysis: NLTK provides tools for syntactic and semantic
analysis, including parsing algorithms, semantic role labeling, and semantic
similarity calculations.
 WordNet: NLTK integrates WordNet, a large lexical database of English words,
which provides a rich resource for semantic relationships, synsets (groups of
synonymous words), and definitions. You can use WordNet to perform tasks
like word sense disambiguation or synonym expansion.
 Machine Learning Integration: NLTK facilitates the integration of machine
learning algorithms for various NLP tasks. It provides support for feature
extraction, classification, clustering, and other machine learning techniques.
 Corpora and Language Resources: NLTK includes numerous pre-processed
corpora and language resources for tasks like sentiment analysis, text
classification, language modeling, and more. These resources can be
leveraged to train models and perform evaluations.
NLTK is highly extensible and allows users to customize and extend its
functionalities as per their requirements. It is widely used by researchers,
students, and professionals in the NLP field due to its comprehensive set of tools,
extensive documentation, and active community support. Overall, NLTK serves as
a powerful toolkit for various NLP tasks and serves as a great starting point for
developing NLP applications in Python.
Foundation of AI and ML (4351601)

Explore more on the following link:


1. https://www.nltk.org/
2. https://realpython.com/nltk-nlp-python/

H. Experimental set up/ Program Logic-Flow chart :

Here is program logic for a Python program that utilizes NLTK for various NLP tasks:
1. Import the necessary modules and libraries:
 nltk for NLP functionalities
 Specific modules like PorterStemmer or WordNetLemmatizer for word
stemming or lemmatization
2. Define functions for each task:
 Tokenization:
 Use word_tokenize() to tokenize words
 Use sent_tokenize() to tokenize sentences
 Word Stemming:
 Initialize a stemmer object (e.g., PorterStemmer())
 Use the stemmer's stem() function to stem each word
 Part-of-Speech (POS) Tagging:
 Use pos_tag() to get POS tags for each word
 Stop Words Removal:
 Use stopwords.words() to get a list of stopwords for a specific
language
 Filter out the stopwords from the tokenized words
 Misspelled Words Correction:
 Initialize a spell checker object (e.g., SpellChecker())
 Use the spell checker's correction() function to correct
misspelled words
3. Get user input or load text from a file.
Foundation of AI and ML (4351601)

4. Perform the desired NLP tasks:


 Tokenize the text into words and sentences.
 Stem or lemmatize the words if required.
 Perform POS tagging on the words.
 Remove stop words from the text.
 Correct any misspelled words in the text.
5. Display the results or store them for further
processing. Here is an example program structure that incorporates these
steps:

import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk import pos_tag, word_tokenize
from spellchecker import SpellChecker

# Tokenization
deftokenize_words(text):
return word_tokenize(text)
deftokenize_sentences(text):
return sent_tokenize(text)

# Word Stemming
defstem_words(words):
stemmer = PorterStemmer()
return [stemmer.stem(word) for word in words]
Foundation of AI and ML (4351601)

# POS Tagging
defidentify_pos(words):
return pos_tag(words)

# Stop Words Removal


defremove_stop_words(words):
stop_words = set(stopwords.words("english"))
return [word for word in words if word.lower() not in stop_words]

# Misspelled Words Correction


defcorrect_spelling(words):
spell = SpellChecker()
return [spell.correction(word) for word in words]

# Example usage
text = "This is an example sentence. And here's another one!"

# Tokenization
words = tokenize_words(text)
sentences = tokenize_sentences(text)

# Word Stemming
stemmed_words = stem_words(words)

# POS Tagging
pos_tags = identify_pos(words)
Foundation of AI and ML (4351601)

# Stop Words Removal


filtered_words = remove_stop_words(words)

# Misspelled Words Correction


corrected_words = correct_spelling(words)

# Display the results


print("Tokenized words:", words)
print("Tokenized sentences:", sentences)
print("Stemmed words:", stemmed_words)
print("POS tags:", pos_tags)
print("Filtered words:", filtered_words)
print("Corrected words:", corrected_words)

I. Resources/Equipment Required

Sr.No. Instrument/Equipment Specification


/Components/Trainer kit
1 Computer system with Windows 7 or higher Ver., macOS, and
operating system Linux, with 4GB or higher RAM, Python
versions: 2.7.X, 3.6.X
2 Python IDEs and Code Editors jupyter, spyder, google colab, Open
Source : Anaconda Navigator

J. Safety and necessary Precautions followed


 Read the experiment thoroughly before starting and ensure that you
understand all the steps and concepts involved from underpinning
theory.
 Keep the workspace clean and organized, free from clutter and unnecessary
materials.
 Use the software according to its intended purpose and instructions.
Foundation of AI and ML (4351601)

 Ensure that all the necessary equipment and software are in good working
condition.
 Never eat or drink in the lab, as it can cause contamination and create
safety hazards.
 If any accidents or injuries occur, immediately notify the instructor and seek
medical attention if necessary.

K. Procedure to be followed/Source code


Student must use the space for writing source code. Understand and re-implement
different methods for handling data.(Exhaustive use of functions must be done)

Source Code & Output:


_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
Foundation of AI and ML (4351601)

_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
Foundation of AI and ML (4351601)

Output:
Foundation of AI and ML (4351601)

M. References / Suggestions
1. https://www.geeksforgeeks.org/machine-learning/
2. https://www.geeksforgeeks.org/natural-language-processing-nlp-tutorial/
3. https://www.tutorialspoint.com/machine_learning_with_python/index.htm

N. Assessment-Rubrics

Total Exceptional Satisfactory Developing Limited


Criteria
Marks (5- Marks) (4 to 3 -Marks) (2-Marks) (1-Mark)

Presentinpractical
Watched other
sessionbutnotatte
Performe d Performed students
ntivelyparticipate
practical practical with performing
dinperformance
Engagement /5 him/hers others help practical but not
elf tried him/herself

Accuracy /5 Accurately done 1-2 3-5 Morethan5errors/


mistakes
errors/mistakes errors/mistakes committed
found identified

No errors, Program Complete write-up Some of the


is well and output tables commands
Documentation /5 Poor write-up and
Executed and but presentation is missing with
diagram or missing
Documented poor missing outputs
content
Properly.

Fully Understood the Partially


Partially
understood performance but understood
Understanding& /5 understood and
the cannot explain the
Explanation cannot give
performance performance &
explanation
& can can give little
explain explanation
perfectly

Work is submitted later


Work done after 2nd
than 1week but by week but before
Time /5 Completed the Work submitted
the end of 2nd the end of 3rd
work after 3 week time
week week
within
1week

Total Marks: /25 Signature with Date:

You might also like