0% found this document useful (0 votes)
20 views

17 - Project Report - NLP-2-27

Uploaded by

nicolesaldanha96
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

17 - Project Report - NLP-2-27

Uploaded by

nicolesaldanha96
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Don Bosco Institute and Technology, Kurla (W), Mumbai-400 070

DEPARTMENT OF COMPUTER ENGINEERING

DECLARATION

We hereby declare that the project entitled “Classification and Analysis of News Reports” submitted for

the CSDL7013 - NLP Lab of final year (Computer Engineering) of Mumbai University syllabus course

work is the work/hypothesis/algorithm and design/mathematical modelling and result analysis for the

specific Machine Learning algorithm.

Gavin Pereira Nicole Saldanha Ananya Solanki

Place: Mumbai

Date: 07/10/2024
Don Bosco Institute and Technology, Kurla(W), Mumbai-400 070
DEPARTMENT OF COMPUTER ENGINEERING

CERTIFICATE

This is to certify that the project titled “Classification and Analysis of News Reports” is the bonafide

work carried out by Gavin Pereira, Nicole Saldanha & Ananya Solanki, the students of Final Year,

Department of Computer Engineering, Don Bosco Institute of Technology, Kurla (W), Mumbai-400 070

affiliated to Mumbai University, Mumbai, Maharashtra (India) during the academic year 2024-2025, in

coursework completion of subject CSDL7013 – NLP Lab of 7th semester.

______________________

Dr. Shaikh Phiroj


Course Teacher

Place : Mumbai

Date: 07/10/2024
ABSTRACT

In today's digital age, the volume of news content generated and consumed is unprecedented,
making it increasingly challenging for readers to find relevant information amidst the noise.
With countless articles published daily across various platforms, effective categorization is
essential to streamline information retrieval and improve user experience. News classification
helps organize articles into specific categories, enabling users to quickly access topics of
interest and stay informed about current events.

Moreover, the sheer variety of news topics—ranging from politics and economics to health and
technology—necessitates a systematic approach to classification. Manual categorization is not
only time-consuming but also prone to human error and inconsistency. Therefore, automating
this process using Natural Language Processing (NLP) and machine learning techniques
presents a viable solution.

Implementing an automated news classification system enhances efficiency and allows media
organizations to provide personalized content recommendations. Furthermore, it can aid in
sentiment analysis, trend detection, and other analytical tasks that require understanding vast
amounts of textual data. By developing a robust classification model, this project aims to
address the challenges of news categorization, facilitating better access to information and
insights for readers and researchers alike.
TABLE OF CONTENTS

Sr. No. Contents Page no.


Chapter 1 Introduction 6
Chapter 2 Literature Survey 7
Chapter 3 Proposed System 9
Chapter 4 Implementation & Result Analysis 11
Conclusion 25
References 26
Acknowledgement 27
CHAPTER 1 : INTRODUCTION

In recent years, Natural Language Processing (NLP) has emerged as a pivotal field in computer science,
focusing on the interaction between computers and human language. With the rapid increase in digital
content, the ability to analyze and interpret vast amounts of text has become essential for various
applications, ranging from sentiment analysis to automated customer support. Text classification, a core
component of NLP, involves categorizing text into predefined classes, enabling organizations to glean
insights from unstructured data effectively.

Despite significant advancements in NLP, challenges persist in achieving high accuracy and efficiency
in text classification. Traditional methods such as Bag of Words (BOW) and Term Frequency-Inverse
Document Frequency (TF-IDF) have laid the groundwork for text representation. However, many
researchers have noted that while these techniques are effective, they often struggle to capture
contextual relationships within the text, leading to suboptimal performance in classification tasks.
Consequently, exploring the nuances of these models and enhancing their performance through
hyperparameter optimization has become an area of active research.

This study aims to investigate the efficacy of the Bag of Words model in text classification compared to
TF-IDF, specifically focusing on optimizing hyperparameters to improve accuracy. The hypothesis
posits that employing BOW with optimized parameters will yield higher accuracy in classifying news
articles compared to TF-IDF, thus demonstrating the potential for improved model performance in text
classification tasks.

To test this hypothesis, a series of experiments were conducted utilizing a dataset of news articles,
employing various text representation methods and tuning model hyperparameters. The findings from
this research are expected to contribute to the broader field of NLP by providing insights into model
selection and optimization techniques, ultimately enhancing the reliability and accuracy of text
classification systems in real-world applications.
CHAPTER 2 : LITERATURE SURVEY

Text classification is a fundamental task in Natural Language Processing (NLP) that involves
categorizing text into predefined categories based on its content. It plays a vital role in various
applications, such as sentiment analysis, spam detection, topic categorization, and more. With the
exponential growth of unstructured text data generated from social media, news articles, and user-
generated content, the need for effective text classification techniques has become increasingly crucial.

1. A Survey on Text Classi cation: From Shallow to Deep Learning


This paper reviews various text classi cation techniques, ranging from traditional methods like
Naive Bayes and Support Vector Machines to modern deep learning approaches. It highlights the
evolution of algorithms and their performance in different applications, providing insights into
the advantages and challenges of each method.

2. Text Classi cation Algorithms: A Survey


This survey presents a comprehensive analysis of text classi cation algorithms, categorizing
them into supervised and unsupervised methods. It discusses the effectiveness of various
approaches, including feature extraction techniques, and compares their performance across
several datasets.

3. Natural Language Processing for Text Classi cation: A Comprehensive Review


The review examines the role of Natural Language Processing in text classi cation tasks. It
emphasizes preprocessing techniques, feature extraction, and model evaluation methods while
exploring the impact of deep learning on traditional classi cation approaches.

4. A Comprehensive Study on Text Classi cation Techniques


This study provides an in-depth exploration of text classi cation techniques, discussing the
evolution of algorithms and their applicability in various domains. It addresses challenges in
data representation and highlights the importance of evaluation metrics in determining algorithm
performance.

5. Deep Learning for Text Classi cation: A Survey


This paper surveys deep learning techniques speci cally designed for text classi cation tasks. It
discusses various architectures, such as Convolutional Neural Networks and Recurrent Neural
Networks, while analyzing their strengths and limitations in handling large-scale text data.

6. A Novel Text Classi cation Model Based on Deep Learning


The proposed model in this paper aims to improve text classi cation accuracy by integrating
multiple deep learning techniques. It evaluates the effectiveness of the model using benchmark
datasets and compares its performance against traditional algorithms.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
7. An Ef cient Text Classi cation Approach Based on Machine Learning Techniques
This paper introduces an ef cient text classi cation framework that combines various machine
learning techniques to enhance classi cation accuracy. It emphasizes the importance of feature
selection and dimensionality reduction in improving model performance.

8. Sentiment Analysis and Text Classi cation Using Machine Learning Techniques
The study explores the intersection of sentiment analysis and text classi cation, employing
machine learning techniques to accurately classify sentiments in textual data. It discusses the
preprocessing steps and features that contribute to model effectiveness.

9. Machine Learning Techniques for Text Classi cation: A Review


This review focuses on machine learning techniques for text classi cation, detailing the
performance of different algorithms. It covers the importance of feature engineering and the
challenges posed by imbalanced datasets in achieving high accuracy.

10. A Study on Text Classi cation Using Machine Learning Techniques


The study investigates various machine learning techniques for text classi cation, highlighting
the role of data preprocessing, feature extraction, and model evaluation. It assesses the strengths
and weaknesses of each approach, providing recommendations for future research.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
CHAPTER 3: PROPOSED SYSTEM

In this chapter, we outline the proposed system for the classi cation and analysis of news reports using
Natural Language Processing (NLP) and machine learning techniques. The primary objective of this
system is to categorize news articles into distinct categories based on their content, enabling easier
access to information and improved user experiences. The proposed system utilizes a structured
approach to data preprocessing, model selection, and evaluation.

System Overview
The proposed system consists of several key components: data collection, data preprocessing, feature
extraction, model training, and evaluation. Each component plays a crucial role in the overall
effectiveness of the system.

1. Data Collection

The system begins by loading a dataset of news articles from a speci ed directory. The dataset,
english_news_dataset.csv, contains various features, including the content of the articles
and their corresponding categories. The data is read into a Pandas DataFrame for easy manipulation and
analysis.

2. Data Preprocessing

Data preprocessing is a critical step in ensuring the quality of input data for machine learning models.
The following preprocessing steps are performed:

• Handling Missing and Duplicated Data: The system checks for and removes any null values
and duplicated entries in the dataset to maintain data integrity.
fi
fi
• Date Formatting: The Date column is converted into a datetime format, allowing for easier
extraction of features such as year, month, and day.

• Text Cleaning: Various text-cleaning techniques are applied, including:

◦ Removal of HTML tags to enhance the clarity of the content.


◦ Elimination of emojis and URLs to avoid distractions and maintain focus on the text.
◦ Punctuation removal and conversion of text to lowercase to standardize the data.
◦ Stopwords removal to reduce noise in the dataset.
• Tokenization: Text is tokenized into words to prepare it for feature extraction, ensuring that the
subsequent analysis can be conducted effectively.

3. Feature Extraction

The system employs a Bag of Words model for feature extraction, which transforms the text data into a
numerical format that machine learning algorithms can process. This transformation involves
converting the cleaned content into a matrix of token counts, allowing the model to understand the text
data better.

4. Model Training

For the classi cation task, the Multinomial Naive Bayes classi er is selected due to its effectiveness in
dealing with text data. The dataset is split into training and testing sets, with 80% of the data used for
training and 20% for testing. The model is then trained on the training set and evaluated on the testing
set to assess its accuracy.

5. Evaluation

The performance of the model is evaluated using various metrics, including accuracy, confusion matrix,
and classi cation report. Cross-validation is also employed to validate the model's performance across
different subsets of the data. This helps ensure that the model is not over tting and can generalize well
to unseen data.

6. Hyperparameter Tuning

To further enhance the model's performance, hyperparameter tuning is conducted using Randomized
Search CV. This process identi es the best combination of parameters for the feature extraction and
classi cation algorithms, optimizing model accuracy.

7. Final Model Evaluation

Once the best model is obtained, it is evaluated against the test set. The results are visualized using bar
charts to compare correct and wrong predictions, providing insights into the model's performance and
areas for improvement.

Conclusion

The proposed system effectively utilizes various NLP and machine learning techniques to classify news
reports into different categories. By implementing a systematic approach to data preprocessing, feature
extraction, model training, and evaluation, the system achieves a robust classi cation performance.
Future work could involve exploring additional machine learning algorithms and more advanced NLP
techniques, such as transformer models, to further enhance classi cation accuracy and insight extraction
from news articles.
fi
fi
fi
fi
fi
fi
fi
fi
CHAPTER 4 : IMPLEMENTATION AND RESULT ANALYSIS

We present an analysis of the results obtained from the classification of news articles using a
Multinomial Naive Bayes model with a Bag of Words (BOW) feature extraction approach. The
performance metrics were evaluated based on the accuracy, classification report, and
visualizations of prediction outcomes.

Data Preparation and Model Training

The dataset consisted of news articles that underwent extensive preprocessing, including:

• Cleaning HTML tags, URLs, emojis, and punctuation.


• Lowercasing text and removing stopwords to standardize the content and improve
feature extraction.
The final dataset was split into training and testing sets (80/20 ratio). A Multinomial Naive
Bayes model was implemented within a pipeline that included a CountVectorizer for
transforming text data into a numerical format.
Dataset:
Data Overview

Upon loading the dataset from the English news articles, a thorough analysis was conducted to
understand its structure:

• Dataset Shape and Quality: The initial exploration revealed the dataset's dimensions,
providing insights into the number of entries and features. The rst few rows were
printed to visualize the structure and content of the data, ensuring its relevance and
integrity.
• Missing Values and Duplicates: An assessment for missing values was performed,
revealing any gaps that could affect model training. Similarly, checks for duplicate
entries were essential to maintain the quality of the dataset. The absence of signi cant
missing values or duplicates suggested that the dataset was largely clean and ready for
further processing.

Class Imbalance Analysis

One signi cant aspect of the dataset was the presence of class imbalance, where some
categories had signi cantly fewer instances than others.

• Identifying Rare Classes: A threshold was set at ve instances to ag any classes


falling below this count. This analysis uncovered several rare categories that could lead
to poor model performance due to insuf cient training data.
• Group Rare Classes: To mitigate the effects of class imbalance, these rare classes were
grouped into an overarching category labeled "Other." This decision was pivotal, as it
allowed the model to generalize better and reduce the noise created by underrepresented
categories.

Data Preprocessing Steps:

To prepare the data for training, several preprocessing steps were implemented, ensuring that
the text data was clean and suitable for the model:

1. Text Cleaning:

◦ HTML Tags: Any HTML tags present in the news articles were removed using
the BeautifulSoup library to prevent interference during analysis.
◦ URLs and Emojis: URLs were stripped from the content to focus solely on the
textual information. Emojis were also removed, as they can introduce ambiguity
and noise in textual data.
◦ Punctuation and Stopwords: The removal of punctuation and common
stopwords further re ned the text, allowing the model to focus on more
meaningful words that contribute to classi cation.
2. Text Normalization:

◦ Lowercasing: Converting all text to lowercase helped standardize the content,


preventing discrepancies due to case sensitivity.
◦ Abbreviation Expansion: A prede ned dictionary was employed to expand
common abbreviations, ensuring clarity in the text and enhancing the model's
understanding of the content.
fi
fi
fi
fi
fi
fi
fi
fi
fl
fi
3. Tokenization:

◦ Sentences were tokenized into words, facilitating a structured representation of


the text that the model could process. This step involved splitting text into
individual words, allowing for better feature extraction.
4. Feature Preparation:

◦ Label Encoding: The target labels (news categories) were encoded numerically
using LabelEncoder, making them suitable for model training.

Model Training and Performance Evaluation

After preprocessing, the data was split into training and testing sets to assess model
performance effectively:

• Model Selection: The Multinomial Naive Bayes algorithm was chosen for its suitability
for text classi cation tasks. It operates based on the probability of words occurring in
different classes, making it ideal for this type of problem.

• Model Training: The model was trained using a pipeline that combined
CountVectorizer for transforming text into feature vectors and the Naive Bayes
classi er for categorization.

• Accuracy Assessment:

◦ The model achieved an impressive accuracy of approximately 0.83 on the test


set. This high accuracy indicates that the model can successfully classify a
signi cant portion of news articles correctly.
• Classi cation Report:

◦ The classi cation report provided detailed metrics, including precision, recall,
and F1-score for each category. These metrics are crucial for understanding the
model's performance across different classes and identifying areas needing
improvement, particularly for categories that may have lower precision or recall.
Cross-Validation and Hyperparameter Tuning

To ensure the model's robustness and generalizability, additional evaluation methods were
employed:

• Cross-Validation:

◦ A three-fold cross-validation approach was applied, yielding mean accuracy


scores. This method involved partitioning the data into three subsets and training
the model on two subsets while validating it on the third. This iterative approach
enhances con dence in the model's predictive capability.
• Hyperparameter Tuning:

◦ A randomized search strategy was utilized to identify optimal hyperparameters


for the model. This process involved testing various con gurations of the
CountVectorizer and the Naive Bayes classi er to nd the combination
that yielded the best performance. The tuned model exhibited enhanced
fi
fi
fi
fi
fi
fi
fi
fi
fi
accuracy, re ecting the importance of hyperparameter optimization in machine
learning.
Visualization and Interpretation of Results

To provide a clearer understanding of the model's performance, several visualizations were


created:

• Correct vs. Wrong Predictions:

◦ A bar chart comparing correct and incorrect predictions illustrated the model's
effectiveness visually. The chart showed a substantial number of correct
predictions (green) relative to incorrect ones (red), reinforcing the model's
reliability.
• Final DataFrame of Predictions:

◦ A summary DataFrame was generated, showcasing the content of the articles


alongside predicted and actual labels. This presentation allows for
straightforward comparisons and highlights speci c cases of misclassi cation,
which can be critical for further analysis and model re nement.
Conclusion and Future Work

The project effectively demonstrated the application of NLP techniques for categorizing news
reports. Through careful data preprocessing, model selection, and hyperparameter tuning, a
robust classi cation system was developed. The achieved accuracy and cross-validation results
underscore the potential of using machine learning in text classi cation tasks.

Future Directions:

• Exploration of Other Classi ers: Investigating alternative machine learning


algorithms, such as Support Vector Machines (SVM), Random Forests, or deep learning
methods (e.g., LSTM, BERT), could yield improved performance and deeper insights
into the data.
• Feature Engineering: Incorporating additional features, such as sentiment analysis or
metadata (e.g., publication date, author), may enhance the model's predictive power.
• Larger Dataset: Training the model on a more extensive and diverse dataset would
help improve generalization and robustness, especially concerning rare categories.
fl
fi
fi
fi
fi
fi
fi
Output Screenshots
Cross-Validation Results

The cross-validation process using Strati ed K-Folds revealed consistent performance across
different data splits. The accuracy scores for each of the three folds were as follows:

• Fold 1: 0.8806
• Fold 2: 0.8800
• Fold 3: 0.8806
This led to a mean accuracy of 0.88, indicating that the model was stable and performed well
regardless of the speci c data split. The cross-validation helped ensure that the model's
performance was not overly dependent on any single train-test split.

Fine-tuning

To further optimize the model, a RandomizedSearchCV was employed for hyperparameter


tuning. This method explored different con gurations of the CountVectorizer and
Multinomial Naive Bayes hyperparameters, aiming to identify the combination that
would yield the best performance. The following parameters were considered:

• max_features: [5000, 10000, None]


• ngram_range: [(1, 1), (1, 2)]
• alpha: A range of values between 0.1 and 2.0 (for smoothing in Naive Bayes)
After conducting 25 ts, the best hyperparameters identi ed were:

• max_features: None
• ngram_range: (1, 2)
• alpha: 1.92
This optimized model was then re-trained on the training data, and when tested, it achieved a
signi cant improvement in accuracy, reaching 0.933 on the test set.

Error Analysis

To further analyze the model's performance, the number of correct and incorrect predictions
was examined:

• Correct Predictions: 37,246


• Wrong Predictions: 2,696
This shows that the model correctly categorized a substantial portion of the test data, although
there were still areas for improvement, particularly in reducing the number of
misclassi cations.

A visualization of the prediction outcomes highlighted the proportion of correct versus wrong
predictions, with correct predictions (green) vastly outnumbering incorrect ones (red).

Final Findings and Insights


fi
fi
fi
fi
fi
fi
fi
1. Bag of Words vs. TF-IDF: The model using Bag of Words (BoW) performed better
than when using TF-IDF, achieving higher accuracy. This suggests that BoW was more
effective in capturing the relevant features for this speci c news classi cation task.

2. Optimal Hyperparameters: The best performance was obtained using the


hyperparameters:

◦ max_features: None
◦ ngram_range: (1, 2)
◦ alpha: 1.92
3. Model Accuracy: After ne-tuning, the optimized model achieved an accuracy of
0.933, with 37,246 correct predictions and 2,696 wrong predictions, re ecting a highly
effective classi cation system.

4. Potential for Improvement: Exploring other machine learning models (e.g., Support
Vector Machines or neural networks) might further enhance accuracy and reduce
misclassi cation.

The findings from this project underscore the effectiveness of using the Bag of Words (BOW)
model over the Term Frequency-Inverse Document Frequency (TF-IDF) method,
demonstrating that BOW achieved a significantly higher accuracy in text classification tasks.
Through careful hyperparameter tuning, specifically setting max_features to None,
ngram_range to (1, 2), and alpha to 0.1116, the model attained an impressive accuracy of
98.4%.

The performance metrics revealed that the model made 39,300 correct predictions, indicating a
robust classification capability, while only 642 predictions were incorrect. This high ratio of
correct predictions suggests that the model effectively captured the nuances of the dataset,
providing a strong foundation for practical applications.

Furthermore, the exploration of alternative models could potentially yield even higher accuracy
and correct prediction rates. This highlights the importance of continual experimentation and
optimization in the field of text classification. Overall, these insights affirm the value of
strategic model selection and parameter tuning in achieving superior performance in NLP tasks,
paving the way for future research and advancements in this domain.
fi
fi
fi
fi
fi
fl
Performance Metrics:

1. Accuracy

The initial accuracy score of the Multinomial Naive Bayes model with the Bag of Words
representation was approximately 0.984. This indicates that the model accurately
classified about 98.4% of the news articles in the test set, reflecting a robust
performance.

2. Hyperparameter Optimization

Through Randomized Search CV, the best hyperparameters were identified:

• max_features: None
• ngram_range: (1, 2)
• alpha: 0.1116
Using these hyperparameters significantly improved the model’s accuracy, achieving a
score of 0.984. This demonstrates the model's ability to generalize well with the selected
parameters.

3. Correct vs. Wrong Predictions

The final model evaluation revealed that:

• Correct Predictions: 39,300


• Wrong Predictions: 642
This excellent ratio of correct to incorrect predictions highlights the model's
effectiveness in classifying news categories.

4. Cross-Validation Scores

Cross-validation using a Stratified K-Fold approach (3 splits) yielded consistent results,


with mean accuracy confirming the model's reliability:

• Mean Accuracy: 98.2%


This stability across different folds indicates that the model performs well on varied
subsets of the dataset, reducing the likelihood of overfitting.
Visualization of Results

A bar chart was generated to illustrate the distribution of correct and wrong predictions. The
chart clearly displayed the high number of correct predictions compared to the relatively few
incorrect ones, reinforcing the model's effectiveness.

Findings and Insights

1. Bag of Words (BOW) vs. TF-IDF: The BOW approach outperformed the TF-IDF
method in this scenario, leading to a higher accuracy. This suggests that for this specific
dataset and classification task, the simpler BOW representation provided more relevant
features for the Naive Bayes model.

2. Optimal Hyperparameters: The hyperparameters identified through Randomized


Search CV played a crucial role in achieving optimal performance. The parameters used
([max_features=None, ngram_range=(1,2), alpha=0.1116]) contributed significantly to
enhancing model accuracy.

3. High Correct Predictions: The model's ability to achieve 39,300 correct predictions
with only 642 incorrect predictions indicates its reliability. Such performance is
promising for real-world applications, where high accuracy is essential for credibility.

4. Potential for Other Models: While the Multinomial Naive Bayes model performed
exceptionally well, exploring other machine learning models could potentially yield even
higher accuracy and correct predictions. Models such as Logistic Regression, Random
Forest, or advanced deep learning approaches may enhance performance further,
especially in more complex datasets.
CONCLUSION

Despite advancements, text classification faces challenges such as handling class imbalance, addressing
noisy data, and effectively representing text data in a form suitable for machine learning models.
Researchers continue to explore innovative techniques for feature extraction, such as Word Embeddings
(Word2Vec, GloVe) and Transformers (BERT, GPT), which improve classification accuracy and
contextual understanding.

The findings from this project underscore the effectiveness of using the Bag of Words (BOW) model
over the Term Frequency-Inverse Document Frequency (TF-IDF) method, demonstrating that BOW
achieved a significantly higher accuracy in text classification tasks. Through careful hyperparameter
tuning, specifically setting max_features to None, ngram_range to (1, 2), and alpha to 0.1116, the
model attained an impressive accuracy of 98.4%.

The performance metrics revealed that the model made 39,300 correct predictions, indicating a robust
classification capability, while only 642 predictions were incorrect. This high ratio of correct predictions
suggests that the model effectively captured the nuances of the dataset, providing a strong foundation
for practical applications.

Furthermore, the exploration of alternative models could potentially yield even higher accuracy and
correct prediction rates. This highlights the importance of continual experimentation and optimization in
the field of text classification. Overall, these insights affirm the value of strategic model selection and
parameter tuning in achieving superior performance in NLP tasks, paving the way for future research
and advancements in this domain.

In summary, text classification is a crucial aspect of NLP, facilitating the organization and interpretation
of vast amounts of textual information. As research and technology continue to progress, the methods
and applications of text classification are expected to expand, further enhancing its significance in
various domains.
REFERENCES

1. A Survey on Text Classi cation: From Shallow to Deep Learning


https://ieeexplore.ieee.org/document/8232995

2. Text Classi cation Algorithms: A Survey


https://ieeexplore.ieee.org/document/7951503

3. Natural Language Processing for Text Classi cation: A Comprehensive Review


https://ieeexplore.ieee.org/document/8350275

4. A Comprehensive Study on Text Classi cation Techniques


https://ieeexplore.ieee.org/document/9350996

5. Deep Learning for Text Classi cation: A Survey


https://ieeexplore.ieee.org/document/7958623

6. A Novel Text Classi cation Model Based on Deep Learning


https://ieeexplore.ieee.org/document/9178713

7. An Ef cient Text Classi cation Approach Based on Machine Learning Techniques


https://ieeexplore.ieee.org/document/8340684

8. Sentiment Analysis and Text Classi cation Using Machine Learning Techniques
https://ieeexplore.ieee.org/document/8671886

9. Machine Learning Techniques for Text Classi cation: A Review


https://ieeexplore.ieee.org/document/8499198

10. A Study on Text Classi cation Using Machine Learning Techniques


https://ieeexplore.ieee.org/document/8601196

• Scikit-learn Documentation: Text Classi cation


https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

• Natural Language Processing with Python


https://www.nltk.org/book/

• Introduction to Natural Language Processing (NLP)


https://towardsdatascience.com/natural-language-processing-nlp-in-python-a-beginners-
guide-5c93f0a7b4a6

• Understanding Multinomial Naive Bayes for Text Classi cation


https://towardsdatascience.com/multinomial-naive-bayes-for-text-
classi cation-5c30e1e473c7

• Cross-Validation in Machine Learning


https://scikit-learn.org/stable/modules/cross_validation.html
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
ACKNOWLEDGEMENT
I would like to express my heartfelt gratitude to the management of DBIT for providing the
opportunities and resources essential for this research. Their commitment to fostering academic
excellence has greatly supported my endeavors.

I am especially thankful to the Computer Department for their invaluable guidance throughout this
project. The faculty’s dedication to student development has significantly influenced my learning
experience.

My sincere appreciation goes to Dr. Phiroj Shaikh, HOD and subject in-charge, whose continuous
encouragement and insightful feedback played a crucial role in shaping this work. His expertise
provided me with the clarity and direction needed to navigate the complexities of this project.

I also wish to acknowledge the contributions of my team members, whose collaboration and diverse
perspectives enriched this project significantly. Lastly, I extend my thanks to my family and friends for
their unwavering support and motivation during this journey. Thank you all for your invaluable
contributions.

Project Team Members :

1. Gavin Pereira B. E. – 47
2. Nicole Saldanha B. E. – 55
3. Ananya Solanki B. E. – 66

You might also like