0% found this document useful (0 votes)

20 views

17 - Project Report - NLP-2-27

Uploaded by

nicolesaldanha96

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

17 - Project Report - NLP-2-27

Uploaded by

nicolesaldanha96

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Don Bosco Institute and Technology, Kurla (W), Mumbai-400 070

DEPARTMENT OF COMPUTER ENGINEERING

DECLARATION

We hereby declare that the project entitled “Classification and Analysis of News Reports” submitted for

the CSDL7013 - NLP Lab of final year (Computer Engineering) of Mumbai University syllabus course

work is the work/hypothesis/algorithm and design/mathematical modelling and result analysis for the

specific Machine Learning algorithm.

Gavin Pereira Nicole Saldanha Ananya Solanki

Place: Mumbai

Date: 07/10/2024
Don Bosco Institute and Technology, Kurla(W), Mumbai-400 070
DEPARTMENT OF COMPUTER ENGINEERING

CERTIFICATE

This is to certify that the project titled “Classification and Analysis of News Reports” is the bonafide

work carried out by Gavin Pereira, Nicole Saldanha & Ananya Solanki, the students of Final Year,

Department of Computer Engineering, Don Bosco Institute of Technology, Kurla (W), Mumbai-400 070

affiliated to Mumbai University, Mumbai, Maharashtra (India) during the academic year 2024-2025, in

coursework completion of subject CSDL7013 – NLP Lab of 7th semester.

______________________

Dr. Shaikh Phiroj

Course Teacher

Place : Mumbai

Date: 07/10/2024
ABSTRACT

In today's digital age, the volume of news content generated and consumed is unprecedented,
making it increasingly challenging for readers to find relevant information amidst the noise.
With countless articles published daily across various platforms, effective categorization is
essential to streamline information retrieval and improve user experience. News classification
helps organize articles into specific categories, enabling users to quickly access topics of
interest and stay informed about current events.

Moreover, the sheer variety of news topics—ranging from politics and economics to health and
technology—necessitates a systematic approach to classification. Manual categorization is not
only time-consuming but also prone to human error and inconsistency. Therefore, automating
this process using Natural Language Processing (NLP) and machine learning techniques
presents a viable solution.

Implementing an automated news classification system enhances efficiency and allows media
organizations to provide personalized content recommendations. Furthermore, it can aid in
sentiment analysis, trend detection, and other analytical tasks that require understanding vast
amounts of textual data. By developing a robust classification model, this project aims to
address the challenges of news categorization, facilitating better access to information and
insights for readers and researchers alike.
TABLE OF CONTENTS

Sr. No. Contents Page no.

Chapter 1 Introduction 6
Chapter 2 Literature Survey 7
Chapter 3 Proposed System 9
Chapter 4 Implementation & Result Analysis 11
Conclusion 25
References 26
Acknowledgement 27
CHAPTER 1 : INTRODUCTION

In recent years, Natural Language Processing (NLP) has emerged as a pivotal field in computer science,
focusing on the interaction between computers and human language. With the rapid increase in digital
content, the ability to analyze and interpret vast amounts of text has become essential for various
applications, ranging from sentiment analysis to automated customer support. Text classification, a core
component of NLP, involves categorizing text into predefined classes, enabling organizations to glean
insights from unstructured data effectively.

Despite significant advancements in NLP, challenges persist in achieving high accuracy and efficiency
in text classification. Traditional methods such as Bag of Words (BOW) and Term Frequency-Inverse
Document Frequency (TF-IDF) have laid the groundwork for text representation. However, many
researchers have noted that while these techniques are effective, they often struggle to capture
contextual relationships within the text, leading to suboptimal performance in classification tasks.
Consequently, exploring the nuances of these models and enhancing their performance through
hyperparameter optimization has become an area of active research.

This study aims to investigate the efficacy of the Bag of Words model in text classification compared to
TF-IDF, specifically focusing on optimizing hyperparameters to improve accuracy. The hypothesis
posits that employing BOW with optimized parameters will yield higher accuracy in classifying news
articles compared to TF-IDF, thus demonstrating the potential for improved model performance in text
classification tasks.

To test this hypothesis, a series of experiments were conducted utilizing a dataset of news articles,
employing various text representation methods and tuning model hyperparameters. The findings from
this research are expected to contribute to the broader field of NLP by providing insights into model
selection and optimization techniques, ultimately enhancing the reliability and accuracy of text
classification systems in real-world applications.
CHAPTER 2 : LITERATURE SURVEY

Text classification is a fundamental task in Natural Language Processing (NLP) that involves
categorizing text into predefined categories based on its content. It plays a vital role in various
applications, such as sentiment analysis, spam detection, topic categorization, and more. With the
exponential growth of unstructured text data generated from social media, news articles, and user-
generated content, the need for effective text classification techniques has become increasingly crucial.

1. A Survey on Text Classi cation: From Shallow to Deep Learning

This paper reviews various text classi cation techniques, ranging from traditional methods like
Naive Bayes and Support Vector Machines to modern deep learning approaches. It highlights the
evolution of algorithms and their performance in different applications, providing insights into
the advantages and challenges of each method.

2. Text Classi cation Algorithms: A Survey

This survey presents a comprehensive analysis of text classi cation algorithms, categorizing
them into supervised and unsupervised methods. It discusses the effectiveness of various
approaches, including feature extraction techniques, and compares their performance across
several datasets.

3. Natural Language Processing for Text Classi cation: A Comprehensive Review

The review examines the role of Natural Language Processing in text classi cation tasks. It
emphasizes preprocessing techniques, feature extraction, and model evaluation methods while
exploring the impact of deep learning on traditional classi cation approaches.

4. A Comprehensive Study on Text Classi cation Techniques

This study provides an in-depth exploration of text classi cation techniques, discussing the
evolution of algorithms and their applicability in various domains. It addresses challenges in
data representation and highlights the importance of evaluation metrics in determining algorithm
performance.

5. Deep Learning for Text Classi cation: A Survey

This paper surveys deep learning techniques speci cally designed for text classi cation tasks. It
discusses various architectures, such as Convolutional Neural Networks and Recurrent Neural
Networks, while analyzing their strengths and limitations in handling large-scale text data.

6. A Novel Text Classi cation Model Based on Deep Learning

The proposed model in this paper aims to improve text classi cation accuracy by integrating
multiple deep learning techniques. It evaluates the effectiveness of the model using benchmark
datasets and compares its performance against traditional algorithms.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
7. An Ef cient Text Classi cation Approach Based on Machine Learning Techniques
This paper introduces an ef cient text classi cation framework that combines various machine
learning techniques to enhance classi cation accuracy. It emphasizes the importance of feature
selection and dimensionality reduction in improving model performance.

8. Sentiment Analysis and Text Classi cation Using Machine Learning Techniques
The study explores the intersection of sentiment analysis and text classi cation, employing
machine learning techniques to accurately classify sentiments in textual data. It discusses the
preprocessing steps and features that contribute to model effectiveness.

9. Machine Learning Techniques for Text Classi cation: A Review

This review focuses on machine learning techniques for text classi cation, detailing the
performance of different algorithms. It covers the importance of feature engineering and the
challenges posed by imbalanced datasets in achieving high accuracy.

10. A Study on Text Classi cation Using Machine Learning Techniques

The study investigates various machine learning techniques for text classi cation, highlighting
the role of data preprocessing, feature extraction, and model evaluation. It assesses the strengths
and weaknesses of each approach, providing recommendations for future research.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
CHAPTER 3: PROPOSED SYSTEM

In this chapter, we outline the proposed system for the classi cation and analysis of news reports using
Natural Language Processing (NLP) and machine learning techniques. The primary objective of this
system is to categorize news articles into distinct categories based on their content, enabling easier
access to information and improved user experiences. The proposed system utilizes a structured
approach to data preprocessing, model selection, and evaluation.

System Overview
The proposed system consists of several key components: data collection, data preprocessing, feature
extraction, model training, and evaluation. Each component plays a crucial role in the overall
effectiveness of the system.

1. Data Collection

The system begins by loading a dataset of news articles from a speci ed directory. The dataset,
english_news_dataset.csv, contains various features, including the content of the articles
and their corresponding categories. The data is read into a Pandas DataFrame for easy manipulation and
analysis.

2. Data Preprocessing

Data preprocessing is a critical step in ensuring the quality of input data for machine learning models.
The following preprocessing steps are performed:

• Handling Missing and Duplicated Data: The system checks for and removes any null values
and duplicated entries in the dataset to maintain data integrity.
fi
fi
• Date Formatting: The Date column is converted into a datetime format, allowing for easier
extraction of features such as year, month, and day.

• Text Cleaning: Various text-cleaning techniques are applied, including:

◦ Removal of HTML tags to enhance the clarity of the content.

◦ Elimination of emojis and URLs to avoid distractions and maintain focus on the text.
◦ Punctuation removal and conversion of text to lowercase to standardize the data.
◦ Stopwords removal to reduce noise in the dataset.
• Tokenization: Text is tokenized into words to prepare it for feature extraction, ensuring that the
subsequent analysis can be conducted effectively.

3. Feature Extraction

The system employs a Bag of Words model for feature extraction, which transforms the text data into a
numerical format that machine learning algorithms can process. This transformation involves
converting the cleaned content into a matrix of token counts, allowing the model to understand the text
data better.

4. Model Training

For the classi cation task, the Multinomial Naive Bayes classi er is selected due to its effectiveness in
dealing with text data. The dataset is split into training and testing sets, with 80% of the data used for
training and 20% for testing. The model is then trained on the training set and evaluated on the testing
set to assess its accuracy.

5. Evaluation

The performance of the model is evaluated using various metrics, including accuracy, confusion matrix,
and classi cation report. Cross-validation is also employed to validate the model's performance across
different subsets of the data. This helps ensure that the model is not over tting and can generalize well
to unseen data.

6. Hyperparameter Tuning

To further enhance the model's performance, hyperparameter tuning is conducted using Randomized
Search CV. This process identi es the best combination of parameters for the feature extraction and
classi cation algorithms, optimizing model accuracy.

7. Final Model Evaluation

Once the best model is obtained, it is evaluated against the test set. The results are visualized using bar
charts to compare correct and wrong predictions, providing insights into the model's performance and
areas for improvement.

Conclusion

The proposed system effectively utilizes various NLP and machine learning techniques to classify news
reports into different categories. By implementing a systematic approach to data preprocessing, feature
extraction, model training, and evaluation, the system achieves a robust classi cation performance.
Future work could involve exploring additional machine learning algorithms and more advanced NLP
techniques, such as transformer models, to further enhance classi cation accuracy and insight extraction
from news articles.
fi
fi
fi
fi
fi
fi
fi
fi
CHAPTER 4 : IMPLEMENTATION AND RESULT ANALYSIS

We present an analysis of the results obtained from the classification of news articles using a
Multinomial Naive Bayes model with a Bag of Words (BOW) feature extraction approach. The
performance metrics were evaluated based on the accuracy, classification report, and
visualizations of prediction outcomes.

Data Preparation and Model Training

The dataset consisted of news articles that underwent extensive preprocessing, including:

• Cleaning HTML tags, URLs, emojis, and punctuation.

• Lowercasing text and removing stopwords to standardize the content and improve
feature extraction.
The final dataset was split into training and testing sets (80/20 ratio). A Multinomial Naive
Bayes model was implemented within a pipeline that included a CountVectorizer for
transforming text data into a numerical format.
Dataset:
Data Overview

Upon loading the dataset from the English news articles, a thorough analysis was conducted to
understand its structure:

• Dataset Shape and Quality: The initial exploration revealed the dataset's dimensions,
providing insights into the number of entries and features. The rst few rows were
printed to visualize the structure and content of the data, ensuring its relevance and
integrity.
• Missing Values and Duplicates: An assessment for missing values was performed,
revealing any gaps that could affect model training. Similarly, checks for duplicate
entries were essential to maintain the quality of the dataset. The absence of signi cant
missing values or duplicates suggested that the dataset was largely clean and ready for
further processing.

Class Imbalance Analysis

One signi cant aspect of the dataset was the presence of class imbalance, where some
categories had signi cantly fewer instances than others.

• Identifying Rare Classes: A threshold was set at ve instances to ag any classes

falling below this count. This analysis uncovered several rare categories that could lead
to poor model performance due to insuf cient training data.
• Group Rare Classes: To mitigate the effects of class imbalance, these rare classes were
grouped into an overarching category labeled "Other." This decision was pivotal, as it
allowed the model to generalize better and reduce the noise created by underrepresented
categories.

Data Preprocessing Steps:

To prepare the data for training, several preprocessing steps were implemented, ensuring that
the text data was clean and suitable for the model:

1. Text Cleaning:

◦ HTML Tags: Any HTML tags present in the news articles were removed using
the BeautifulSoup library to prevent interference during analysis.
◦ URLs and Emojis: URLs were stripped from the content to focus solely on the
textual information. Emojis were also removed, as they can introduce ambiguity
and noise in textual data.
◦ Punctuation and Stopwords: The removal of punctuation and common
stopwords further re ned the text, allowing the model to focus on more
meaningful words that contribute to classi cation.
2. Text Normalization:

◦ Lowercasing: Converting all text to lowercase helped standardize the content,

preventing discrepancies due to case sensitivity.
◦ Abbreviation Expansion: A prede ned dictionary was employed to expand
common abbreviations, ensuring clarity in the text and enhancing the model's
understanding of the content.
fi
fi
fi
fi
fi
fi
fi
fi
fl
fi
3. Tokenization:

◦ Sentences were tokenized into words, facilitating a structured representation of

the text that the model could process. This step involved splitting text into
individual words, allowing for better feature extraction.
4. Feature Preparation:

◦ Label Encoding: The target labels (news categories) were encoded numerically
using LabelEncoder, making them suitable for model training.

Model Training and Performance Evaluation

After preprocessing, the data was split into training and testing sets to assess model
performance effectively:

• Model Selection: The Multinomial Naive Bayes algorithm was chosen for its suitability
for text classi cation tasks. It operates based on the probability of words occurring in
different classes, making it ideal for this type of problem.

• Model Training: The model was trained using a pipeline that combined
CountVectorizer for transforming text into feature vectors and the Naive Bayes
classi er for categorization.

• Accuracy Assessment:

◦ The model achieved an impressive accuracy of approximately 0.83 on the test

set. This high accuracy indicates that the model can successfully classify a
signi cant portion of news articles correctly.
• Classi cation Report:

◦ The classi cation report provided detailed metrics, including precision, recall,
and F1-score for each category. These metrics are crucial for understanding the
model's performance across different classes and identifying areas needing
improvement, particularly for categories that may have lower precision or recall.
Cross-Validation and Hyperparameter Tuning

To ensure the model's robustness and generalizability, additional evaluation methods were
employed:

• Cross-Validation:

◦ A three-fold cross-validation approach was applied, yielding mean accuracy

scores. This method involved partitioning the data into three subsets and training
the model on two subsets while validating it on the third. This iterative approach
enhances con dence in the model's predictive capability.
• Hyperparameter Tuning:

◦ A randomized search strategy was utilized to identify optimal hyperparameters

for the model. This process involved testing various con gurations of the
CountVectorizer and the Naive Bayes classi er to nd the combination
that yielded the best performance. The tuned model exhibited enhanced
fi
fi
fi
fi
fi
fi
fi
fi
fi
accuracy, re ecting the importance of hyperparameter optimization in machine
learning.
Visualization and Interpretation of Results

To provide a clearer understanding of the model's performance, several visualizations were

created:

• Correct vs. Wrong Predictions:

◦ A bar chart comparing correct and incorrect predictions illustrated the model's
effectiveness visually. The chart showed a substantial number of correct
predictions (green) relative to incorrect ones (red), reinforcing the model's
reliability.
• Final DataFrame of Predictions:

◦ A summary DataFrame was generated, showcasing the content of the articles

alongside predicted and actual labels. This presentation allows for
straightforward comparisons and highlights speci c cases of misclassi cation,
which can be critical for further analysis and model re nement.
Conclusion and Future Work

The project effectively demonstrated the application of NLP techniques for categorizing news
reports. Through careful data preprocessing, model selection, and hyperparameter tuning, a
robust classi cation system was developed. The achieved accuracy and cross-validation results
underscore the potential of using machine learning in text classi cation tasks.

Future Directions:

• Exploration of Other Classi ers: Investigating alternative machine learning

algorithms, such as Support Vector Machines (SVM), Random Forests, or deep learning
methods (e.g., LSTM, BERT), could yield improved performance and deeper insights
into the data.
• Feature Engineering: Incorporating additional features, such as sentiment analysis or
metadata (e.g., publication date, author), may enhance the model's predictive power.
• Larger Dataset: Training the model on a more extensive and diverse dataset would
help improve generalization and robustness, especially concerning rare categories.
fl
fi
fi
fi
fi
fi
fi
Output Screenshots
Cross-Validation Results

The cross-validation process using Strati ed K-Folds revealed consistent performance across
different data splits. The accuracy scores for each of the three folds were as follows:

• Fold 1: 0.8806
• Fold 2: 0.8800
• Fold 3: 0.8806
This led to a mean accuracy of 0.88, indicating that the model was stable and performed well
regardless of the speci c data split. The cross-validation helped ensure that the model's
performance was not overly dependent on any single train-test split.

Fine-tuning

To further optimize the model, a RandomizedSearchCV was employed for hyperparameter

tuning. This method explored different con gurations of the CountVectorizer and
Multinomial Naive Bayes hyperparameters, aiming to identify the combination that
would yield the best performance. The following parameters were considered:

• max_features: [5000, 10000, None]

• ngram_range: [(1, 1), (1, 2)]
• alpha: A range of values between 0.1 and 2.0 (for smoothing in Naive Bayes)
After conducting 25 ts, the best hyperparameters identi ed were:

• max_features: None
• ngram_range: (1, 2)
• alpha: 1.92
This optimized model was then re-trained on the training data, and when tested, it achieved a
signi cant improvement in accuracy, reaching 0.933 on the test set.

Error Analysis

To further analyze the model's performance, the number of correct and incorrect predictions
was examined:

• Correct Predictions: 37,246

• Wrong Predictions: 2,696
This shows that the model correctly categorized a substantial portion of the test data, although
there were still areas for improvement, particularly in reducing the number of
misclassi cations.

A visualization of the prediction outcomes highlighted the proportion of correct versus wrong
predictions, with correct predictions (green) vastly outnumbering incorrect ones (red).

Final Findings and Insights

fi
fi
fi
fi
fi
fi
fi
1. Bag of Words vs. TF-IDF: The model using Bag of Words (BoW) performed better
than when using TF-IDF, achieving higher accuracy. This suggests that BoW was more
effective in capturing the relevant features for this speci c news classi cation task.

2. Optimal Hyperparameters: The best performance was obtained using the

hyperparameters:

◦ max_features: None
◦ ngram_range: (1, 2)
◦ alpha: 1.92
3. Model Accuracy: After ne-tuning, the optimized model achieved an accuracy of
0.933, with 37,246 correct predictions and 2,696 wrong predictions, re ecting a highly
effective classi cation system.

4. Potential for Improvement: Exploring other machine learning models (e.g., Support
Vector Machines or neural networks) might further enhance accuracy and reduce
misclassi cation.

The findings from this project underscore the effectiveness of using the Bag of Words (BOW)
model over the Term Frequency-Inverse Document Frequency (TF-IDF) method,
demonstrating that BOW achieved a significantly higher accuracy in text classification tasks.
Through careful hyperparameter tuning, specifically setting max_features to None,
ngram_range to (1, 2), and alpha to 0.1116, the model attained an impressive accuracy of
98.4%.

The performance metrics revealed that the model made 39,300 correct predictions, indicating a
robust classification capability, while only 642 predictions were incorrect. This high ratio of
correct predictions suggests that the model effectively captured the nuances of the dataset,
providing a strong foundation for practical applications.

Furthermore, the exploration of alternative models could potentially yield even higher accuracy
and correct prediction rates. This highlights the importance of continual experimentation and
optimization in the field of text classification. Overall, these insights affirm the value of
strategic model selection and parameter tuning in achieving superior performance in NLP tasks,
paving the way for future research and advancements in this domain.
fi
fi
fi
fi
fi
fl
Performance Metrics:

1. Accuracy

The initial accuracy score of the Multinomial Naive Bayes model with the Bag of Words
representation was approximately 0.984. This indicates that the model accurately
classified about 98.4% of the news articles in the test set, reflecting a robust
performance.

2. Hyperparameter Optimization

Through Randomized Search CV, the best hyperparameters were identified:

• max_features: None
• ngram_range: (1, 2)
• alpha: 0.1116
Using these hyperparameters significantly improved the model’s accuracy, achieving a
score of 0.984. This demonstrates the model's ability to generalize well with the selected
parameters.

3. Correct vs. Wrong Predictions

The final model evaluation revealed that:

• Correct Predictions: 39,300

• Wrong Predictions: 642
This excellent ratio of correct to incorrect predictions highlights the model's
effectiveness in classifying news categories.

4. Cross-Validation Scores

Cross-validation using a Stratified K-Fold approach (3 splits) yielded consistent results,

with mean accuracy confirming the model's reliability:

• Mean Accuracy: 98.2%

This stability across different folds indicates that the model performs well on varied
subsets of the dataset, reducing the likelihood of overfitting.
Visualization of Results

A bar chart was generated to illustrate the distribution of correct and wrong predictions. The
chart clearly displayed the high number of correct predictions compared to the relatively few
incorrect ones, reinforcing the model's effectiveness.

Findings and Insights

1. Bag of Words (BOW) vs. TF-IDF: The BOW approach outperformed the TF-IDF
method in this scenario, leading to a higher accuracy. This suggests that for this specific
dataset and classification task, the simpler BOW representation provided more relevant
features for the Naive Bayes model.

2. Optimal Hyperparameters: The hyperparameters identified through Randomized

Search CV played a crucial role in achieving optimal performance. The parameters used
([max_features=None, ngram_range=(1,2), alpha=0.1116]) contributed significantly to
enhancing model accuracy.

3. High Correct Predictions: The model's ability to achieve 39,300 correct predictions
with only 642 incorrect predictions indicates its reliability. Such performance is
promising for real-world applications, where high accuracy is essential for credibility.

4. Potential for Other Models: While the Multinomial Naive Bayes model performed
exceptionally well, exploring other machine learning models could potentially yield even
higher accuracy and correct predictions. Models such as Logistic Regression, Random
Forest, or advanced deep learning approaches may enhance performance further,
especially in more complex datasets.
CONCLUSION

Despite advancements, text classification faces challenges such as handling class imbalance, addressing
noisy data, and effectively representing text data in a form suitable for machine learning models.
Researchers continue to explore innovative techniques for feature extraction, such as Word Embeddings
(Word2Vec, GloVe) and Transformers (BERT, GPT), which improve classification accuracy and
contextual understanding.

The findings from this project underscore the effectiveness of using the Bag of Words (BOW) model
over the Term Frequency-Inverse Document Frequency (TF-IDF) method, demonstrating that BOW
achieved a significantly higher accuracy in text classification tasks. Through careful hyperparameter
tuning, specifically setting max_features to None, ngram_range to (1, 2), and alpha to 0.1116, the
model attained an impressive accuracy of 98.4%.

The performance metrics revealed that the model made 39,300 correct predictions, indicating a robust
classification capability, while only 642 predictions were incorrect. This high ratio of correct predictions
suggests that the model effectively captured the nuances of the dataset, providing a strong foundation
for practical applications.

Furthermore, the exploration of alternative models could potentially yield even higher accuracy and
correct prediction rates. This highlights the importance of continual experimentation and optimization in
the field of text classification. Overall, these insights affirm the value of strategic model selection and
parameter tuning in achieving superior performance in NLP tasks, paving the way for future research
and advancements in this domain.

In summary, text classification is a crucial aspect of NLP, facilitating the organization and interpretation
of vast amounts of textual information. As research and technology continue to progress, the methods
and applications of text classification are expected to expand, further enhancing its significance in
various domains.
REFERENCES

1. A Survey on Text Classi cation: From Shallow to Deep Learning

https://ieeexplore.ieee.org/document/8232995

2. Text Classi cation Algorithms: A Survey

https://ieeexplore.ieee.org/document/7951503

3. Natural Language Processing for Text Classi cation: A Comprehensive Review

https://ieeexplore.ieee.org/document/8350275

4. A Comprehensive Study on Text Classi cation Techniques

https://ieeexplore.ieee.org/document/9350996

5. Deep Learning for Text Classi cation: A Survey

https://ieeexplore.ieee.org/document/7958623

6. A Novel Text Classi cation Model Based on Deep Learning

https://ieeexplore.ieee.org/document/9178713

7. An Ef cient Text Classi cation Approach Based on Machine Learning Techniques

https://ieeexplore.ieee.org/document/8340684

8. Sentiment Analysis and Text Classi cation Using Machine Learning Techniques
https://ieeexplore.ieee.org/document/8671886

9. Machine Learning Techniques for Text Classi cation: A Review

https://ieeexplore.ieee.org/document/8499198

10. A Study on Text Classi cation Using Machine Learning Techniques

https://ieeexplore.ieee.org/document/8601196

• Scikit-learn Documentation: Text Classi cation

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

• Natural Language Processing with Python

https://www.nltk.org/book/

• Introduction to Natural Language Processing (NLP)

https://towardsdatascience.com/natural-language-processing-nlp-in-python-a-beginners-
guide-5c93f0a7b4a6

• Understanding Multinomial Naive Bayes for Text Classi cation

https://towardsdatascience.com/multinomial-naive-bayes-for-text-
classi cation-5c30e1e473c7

• Cross-Validation in Machine Learning

https://scikit-learn.org/stable/modules/cross_validation.html
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
ACKNOWLEDGEMENT
I would like to express my heartfelt gratitude to the management of DBIT for providing the
opportunities and resources essential for this research. Their commitment to fostering academic
excellence has greatly supported my endeavors.

I am especially thankful to the Computer Department for their invaluable guidance throughout this
project. The faculty’s dedication to student development has significantly influenced my learning
experience.

My sincere appreciation goes to Dr. Phiroj Shaikh, HOD and subject in-charge, whose continuous
encouragement and insightful feedback played a crucial role in shaping this work. His expertise
provided me with the clarity and direction needed to navigate the complexities of this project.

I also wish to acknowledge the contributions of my team members, whose collaboration and diverse
perspectives enriched this project significantly. Lastly, I extend my thanks to my family and friends for
their unwavering support and motivation during this journey. Thank you all for your invaluable
contributions.

Project Team Members :

1. Gavin Pereira B. E. – 47
2. Nicole Saldanha B. E. – 55
3. Ananya Solanki B. E. – 66

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification
No ratings yet
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification
16 pages
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
From Everand
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
Shane Snipes, PhD
No ratings yet
AKPADE Tadagbé Mathias Mémoire de Maîtrise en Linguistique Anglaise Appliquée
50% (2)
AKPADE Tadagbé Mathias Mémoire de Maîtrise en Linguistique Anglaise Appliquée
102 pages
A Complete Process of Text Classification System Using State‐of‐the‐Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State‐of‐the‐Art NLP Models
26 pages
Deep Learning
No ratings yet
Deep Learning
42 pages
spam detection
No ratings yet
spam detection
39 pages
Project Proposal - Group 17-2-5
No ratings yet
Project Proposal - Group 17-2-5
4 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Text Classification Based on Machine Learning and
No ratings yet
Text Classification Based on Machine Learning and
12 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
27 pages
Best Text To Speech Ai - Aitech - Studio
No ratings yet
Best Text To Speech Ai - Aitech - Studio
8 pages
text classification research paper 2
No ratings yet
text classification research paper 2
7 pages
Comparison of Text Classifiers On News Articles
No ratings yet
Comparison of Text Classifiers On News Articles
5 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
Unit 2
No ratings yet
Unit 2
26 pages
Unit-3
No ratings yet
Unit-3
27 pages
IEEE-paper on NLP
No ratings yet
IEEE-paper on NLP
3 pages
Enhancing Text Classification Through Novel Deep Learning Sequential Attention Fusion Architecture
No ratings yet
Enhancing Text Classification Through Novel Deep Learning Sequential Attention Fusion Architecture
12 pages
NLP m4
No ratings yet
NLP m4
97 pages
13. TEXT CLASSIFICATION USING NLP
No ratings yet
13. TEXT CLASSIFICATION USING NLP
28 pages
UNIT-III Text Classification
No ratings yet
UNIT-III Text Classification
4 pages
17 Result Analysis NLP
No ratings yet
17 Result Analysis NLP
13 pages
Survey On Text Classification
No ratings yet
Survey On Text Classification
7 pages
IR - Group1
No ratings yet
IR - Group1
27 pages
Text Classification PDF
No ratings yet
Text Classification PDF
7 pages
research paper 3
No ratings yet
research paper 3
7 pages
text classification research paper 3
No ratings yet
text classification research paper 3
13 pages
Jurnal
No ratings yet
Jurnal
19 pages
Kshitij Text Classification
No ratings yet
Kshitij Text Classification
20 pages
News Classsification
No ratings yet
News Classsification
11 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
Data Mining Report
No ratings yet
Data Mining Report
28 pages
internship codsoft machine learning
No ratings yet
internship codsoft machine learning
36 pages
mining text data and classificatin
No ratings yet
mining text data and classificatin
4 pages
Review of Text Classification Methods On Deep Learning
No ratings yet
Review of Text Classification Methods On Deep Learning
13 pages
A Learning Based Approach For Automatic Text Document Classification
No ratings yet
A Learning Based Approach For Automatic Text Document Classification
7 pages
ML_Record (1)
No ratings yet
ML_Record (1)
28 pages
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
No ratings yet
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
53 pages
111 1460444112 - 12-04-2016 PDF
No ratings yet
111 1460444112 - 12-04-2016 PDF
7 pages
Woleola
No ratings yet
Woleola
1 page
19_ArticleClassificationusingNaturalLanguageProcessingandMachineLearning
No ratings yet
19_ArticleClassificationusingNaturalLanguageProcessingandMachineLearning
8 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Researchpaperclassification IEEEprocedding 1
No ratings yet
Researchpaperclassification IEEEprocedding 1
7 pages
Machine Learning in Automated Text Categorization
No ratings yet
Machine Learning in Automated Text Categorization
55 pages
paper 1-- 1662-Article Text-12759-12507-10-20210526
No ratings yet
paper 1-- 1662-Article Text-12759-12507-10-20210526
2 pages
A Survey On Machine Learning Techniques
No ratings yet
A Survey On Machine Learning Techniques
8 pages
News Classification Using Machine Learning
No ratings yet
News Classification Using Machine Learning
5 pages
A New Text Mining Approach Based On HMM-SVM For Web News Classification
No ratings yet
A New Text Mining Approach Based On HMM-SVM For Web News Classification
8 pages
A Comparative Analysis of Machine Learning Techniques On Fake News Detection 1
No ratings yet
A Comparative Analysis of Machine Learning Techniques On Fake News Detection 1
42 pages
Group08_BDM01_Topic-Modelling-in-Text-Classification
No ratings yet
Group08_BDM01_Topic-Modelling-in-Text-Classification
19 pages
END CRP
No ratings yet
END CRP
26 pages
ssrn-4946728
No ratings yet
ssrn-4946728
7 pages
Paper 11
No ratings yet
Paper 11
5 pages
J Ipm 2019 102121
No ratings yet
J Ipm 2019 102121
17 pages
Musa IEEE
No ratings yet
Musa IEEE
6 pages
An Analysis Method For Interpretability of CNN Text Classification Model
No ratings yet
An Analysis Method For Interpretability of CNN Text Classification Model
14 pages
Classification Survey
No ratings yet
Classification Survey
40 pages
A Survey of Text Classification With Transformers How Wide How Large How Long How Accurate How Expensive How Safe
No ratings yet
A Survey of Text Classification With Transformers How Wide How Large How Long How Accurate How Expensive How Safe
14 pages
An Overview of Categorization Techniques: B. Mahalakshmi, Dr. K. Duraiswamy
No ratings yet
An Overview of Categorization Techniques: B. Mahalakshmi, Dr. K. Duraiswamy
7 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Python Machine Learning Projects: Learn how to build Machine Learning projects from scratch (English Edition)
From Everand
Python Machine Learning Projects: Learn how to build Machine Learning projects from scratch (English Edition)
Dr. Deepali R Vora
No ratings yet
RH - G2B - U6 - L5 - Saving - An - Island - Sherry - 200117175421
No ratings yet
RH - G2B - U6 - L5 - Saving - An - Island - Sherry - 200117175421
28 pages
Lesson 5&6 - The Baha'i Faith
No ratings yet
Lesson 5&6 - The Baha'i Faith
14 pages
D-Link MFP Print Server Printer Compatible Lists - 20090702 - For DPR-1061
No ratings yet
D-Link MFP Print Server Printer Compatible Lists - 20090702 - For DPR-1061
17 pages
Emr Specification V4.1a - Appendix F - Mar082013 v1
No ratings yet
Emr Specification V4.1a - Appendix F - Mar082013 v1
33 pages
Sura and Thier Benifits
100% (1)
Sura and Thier Benifits
15 pages
TRANSLATION STUDIES 4th (1) BA
No ratings yet
TRANSLATION STUDIES 4th (1) BA
31 pages
Oracle WebCenter Interview Questions and Answers
No ratings yet
Oracle WebCenter Interview Questions and Answers
10 pages
HTML
No ratings yet
HTML
104 pages
Chapter 6 Assembly Language-PPandMS-1617
No ratings yet
Chapter 6 Assembly Language-PPandMS-1617
14 pages
3GPP TS 23.040
No ratings yet
3GPP TS 23.040
219 pages
Making Progress To First Certificate PDF
No ratings yet
Making Progress To First Certificate PDF
2 pages
Reading Ii Module PDF (1) ...
No ratings yet
Reading Ii Module PDF (1) ...
34 pages
Anti-Hodder - Velandia César English Version
No ratings yet
Anti-Hodder - Velandia César English Version
18 pages
Changing The Default Password For Sap
No ratings yet
Changing The Default Password For Sap
2 pages
Exception Handling in PL SQL Oracle PDF
No ratings yet
Exception Handling in PL SQL Oracle PDF
2 pages
AVEVA Marine Drafting PDF
No ratings yet
AVEVA Marine Drafting PDF
2 pages
First Comes Love A1-A2 UKR
No ratings yet
First Comes Love A1-A2 UKR
10 pages
CFIN 1909 - Certification QA Dumps
No ratings yet
CFIN 1909 - Certification QA Dumps
14 pages
Tonight I can write
No ratings yet
Tonight I can write
2 pages
V2V_OSY_Super_25_Questions_Solution
No ratings yet
V2V_OSY_Super_25_Questions_Solution
5 pages
Python Fundamentals PPT 7 19
No ratings yet
Python Fundamentals PPT 7 19
13 pages
Embedded Tutorial Solutions-1
No ratings yet
Embedded Tutorial Solutions-1
14 pages
Journey Into Buddhahood_ the Five Paths and Ten Stages of -- Tempa Dukte Lama -- Pittsburgh, Pennsylvania, 2014 -- Olmo Ling Publications -- 9780983545620 -- 3bae3be19dd527434d33e23f48bbaab2 -- Anna’s Archive (1)
No ratings yet
Journey Into Buddhahood_ the Five Paths and Ten Stages of -- Tempa Dukte Lama -- Pittsburgh, Pennsylvania, 2014 -- Olmo Ling Publications -- 9780983545620 -- 3bae3be19dd527434d33e23f48bbaab2 -- Anna’s Archive (1)
266 pages
Macbeth
No ratings yet
Macbeth
1 page
Cit 101 Main Text
0% (2)
Cit 101 Main Text
221 pages
216 239 - 20 40 - QuickStart
No ratings yet
216 239 - 20 40 - QuickStart
8 pages
Java Lab Manual
No ratings yet
Java Lab Manual
66 pages
Ingles 5 Modulo 4
No ratings yet
Ingles 5 Modulo 4
24 pages
Secchi S A Circleline Study of Mathematical Analysis
100% (3)
Secchi S A Circleline Study of Mathematical Analysis
480 pages

17 - Project Report - NLP-2-27

Uploaded by

17 - Project Report - NLP-2-27

Uploaded by

Don Bosco Institute and Technology, Kurla (W), Mumbai-400 070

DEPARTMENT OF COMPUTER ENGINEERING

specific Machine Learning algorithm.

Gavin Pereira Nicole Saldanha Ananya Solanki

coursework completion of subject CSDL7013 – NLP Lab of 7th semester.

Dr. Shaikh Phiroj

Sr. No. Contents Page no.

1. A Survey on Text Classi cation: From Shallow to Deep Learning

2. Text Classi cation Algorithms: A Survey

3. Natural Language Processing for Text Classi cation: A Comprehensive Review

4. A Comprehensive Study on Text Classi cation Techniques

5. Deep Learning for Text Classi cation: A Survey

6. A Novel Text Classi cation Model Based on Deep Learning

9. Machine Learning Techniques for Text Classi cation: A Review

10. A Study on Text Classi cation Using Machine Learning Techniques

• Text Cleaning: Various text-cleaning techniques are applied, including:

◦ Removal of HTML tags to enhance the clarity of the content.

7. Final Model Evaluation

Data Preparation and Model Training

• Cleaning HTML tags, URLs, emojis, and punctuation.

Class Imbalance Analysis

• Identifying Rare Classes: A threshold was set at ve instances to ag any classes

Data Preprocessing Steps:

◦ Lowercasing: Converting all text to lowercase helped standardize the content,

◦ Sentences were tokenized into words, facilitating a structured representation of

Model Training and Performance Evaluation

◦ The model achieved an impressive accuracy of approximately 0.83 on the test

◦ A three-fold cross-validation approach was applied, yielding mean accuracy

◦ A randomized search strategy was utilized to identify optimal hyperparameters

To provide a clearer understanding of the model's performance, several visualizations were

• Correct vs. Wrong Predictions:

◦ A summary DataFrame was generated, showcasing the content of the articles

• Exploration of Other Classi ers: Investigating alternative machine learning

To further optimize the model, a RandomizedSearchCV was employed for hyperparameter

• max_features: [5000, 10000, None]

• Correct Predictions: 37,246

Final Findings and Insights

2. Optimal Hyperparameters: The best performance was obtained using the

Through Randomized Search CV, the best hyperparameters were identified:

3. Correct vs. Wrong Predictions

The final model evaluation revealed that:

• Correct Predictions: 39,300

Cross-validation using a Stratified K-Fold approach (3 splits) yielded consistent results,

• Mean Accuracy: 98.2%

Findings and Insights

2. Optimal Hyperparameters: The hyperparameters identified through Randomized

1. A Survey on Text Classi cation: From Shallow to Deep Learning

2. Text Classi cation Algorithms: A Survey

3. Natural Language Processing for Text Classi cation: A Comprehensive Review

4. A Comprehensive Study on Text Classi cation Techniques

5. Deep Learning for Text Classi cation: A Survey

6. A Novel Text Classi cation Model Based on Deep Learning

7. An Ef cient Text Classi cation Approach Based on Machine Learning Techniques

9. Machine Learning Techniques for Text Classi cation: A Review

10. A Study on Text Classi cation Using Machine Learning Techniques

• Scikit-learn Documentation: Text Classi cation

• Natural Language Processing with Python

• Introduction to Natural Language Processing (NLP)

• Understanding Multinomial Naive Bayes for Text Classi cation

• Cross-Validation in Machine Learning

Project Team Members :

You might also like