17 - Project Report - NLP-2-27
17 - Project Report - NLP-2-27
DECLARATION
We hereby declare that the project entitled “Classification and Analysis of News Reports” submitted for
the CSDL7013 - NLP Lab of final year (Computer Engineering) of Mumbai University syllabus course
work is the work/hypothesis/algorithm and design/mathematical modelling and result analysis for the
Place: Mumbai
Date: 07/10/2024
Don Bosco Institute and Technology, Kurla(W), Mumbai-400 070
DEPARTMENT OF COMPUTER ENGINEERING
CERTIFICATE
This is to certify that the project titled “Classification and Analysis of News Reports” is the bonafide
work carried out by Gavin Pereira, Nicole Saldanha & Ananya Solanki, the students of Final Year,
Department of Computer Engineering, Don Bosco Institute of Technology, Kurla (W), Mumbai-400 070
affiliated to Mumbai University, Mumbai, Maharashtra (India) during the academic year 2024-2025, in
______________________
Place : Mumbai
Date: 07/10/2024
ABSTRACT
In today's digital age, the volume of news content generated and consumed is unprecedented,
making it increasingly challenging for readers to find relevant information amidst the noise.
With countless articles published daily across various platforms, effective categorization is
essential to streamline information retrieval and improve user experience. News classification
helps organize articles into specific categories, enabling users to quickly access topics of
interest and stay informed about current events.
Moreover, the sheer variety of news topics—ranging from politics and economics to health and
technology—necessitates a systematic approach to classification. Manual categorization is not
only time-consuming but also prone to human error and inconsistency. Therefore, automating
this process using Natural Language Processing (NLP) and machine learning techniques
presents a viable solution.
Implementing an automated news classification system enhances efficiency and allows media
organizations to provide personalized content recommendations. Furthermore, it can aid in
sentiment analysis, trend detection, and other analytical tasks that require understanding vast
amounts of textual data. By developing a robust classification model, this project aims to
address the challenges of news categorization, facilitating better access to information and
insights for readers and researchers alike.
TABLE OF CONTENTS
In recent years, Natural Language Processing (NLP) has emerged as a pivotal field in computer science,
focusing on the interaction between computers and human language. With the rapid increase in digital
content, the ability to analyze and interpret vast amounts of text has become essential for various
applications, ranging from sentiment analysis to automated customer support. Text classification, a core
component of NLP, involves categorizing text into predefined classes, enabling organizations to glean
insights from unstructured data effectively.
Despite significant advancements in NLP, challenges persist in achieving high accuracy and efficiency
in text classification. Traditional methods such as Bag of Words (BOW) and Term Frequency-Inverse
Document Frequency (TF-IDF) have laid the groundwork for text representation. However, many
researchers have noted that while these techniques are effective, they often struggle to capture
contextual relationships within the text, leading to suboptimal performance in classification tasks.
Consequently, exploring the nuances of these models and enhancing their performance through
hyperparameter optimization has become an area of active research.
This study aims to investigate the efficacy of the Bag of Words model in text classification compared to
TF-IDF, specifically focusing on optimizing hyperparameters to improve accuracy. The hypothesis
posits that employing BOW with optimized parameters will yield higher accuracy in classifying news
articles compared to TF-IDF, thus demonstrating the potential for improved model performance in text
classification tasks.
To test this hypothesis, a series of experiments were conducted utilizing a dataset of news articles,
employing various text representation methods and tuning model hyperparameters. The findings from
this research are expected to contribute to the broader field of NLP by providing insights into model
selection and optimization techniques, ultimately enhancing the reliability and accuracy of text
classification systems in real-world applications.
CHAPTER 2 : LITERATURE SURVEY
Text classification is a fundamental task in Natural Language Processing (NLP) that involves
categorizing text into predefined categories based on its content. It plays a vital role in various
applications, such as sentiment analysis, spam detection, topic categorization, and more. With the
exponential growth of unstructured text data generated from social media, news articles, and user-
generated content, the need for effective text classification techniques has become increasingly crucial.
8. Sentiment Analysis and Text Classi cation Using Machine Learning Techniques
The study explores the intersection of sentiment analysis and text classi cation, employing
machine learning techniques to accurately classify sentiments in textual data. It discusses the
preprocessing steps and features that contribute to model effectiveness.
In this chapter, we outline the proposed system for the classi cation and analysis of news reports using
Natural Language Processing (NLP) and machine learning techniques. The primary objective of this
system is to categorize news articles into distinct categories based on their content, enabling easier
access to information and improved user experiences. The proposed system utilizes a structured
approach to data preprocessing, model selection, and evaluation.
System Overview
The proposed system consists of several key components: data collection, data preprocessing, feature
extraction, model training, and evaluation. Each component plays a crucial role in the overall
effectiveness of the system.
1. Data Collection
The system begins by loading a dataset of news articles from a speci ed directory. The dataset,
english_news_dataset.csv, contains various features, including the content of the articles
and their corresponding categories. The data is read into a Pandas DataFrame for easy manipulation and
analysis.
2. Data Preprocessing
Data preprocessing is a critical step in ensuring the quality of input data for machine learning models.
The following preprocessing steps are performed:
• Handling Missing and Duplicated Data: The system checks for and removes any null values
and duplicated entries in the dataset to maintain data integrity.
fi
fi
• Date Formatting: The Date column is converted into a datetime format, allowing for easier
extraction of features such as year, month, and day.
3. Feature Extraction
The system employs a Bag of Words model for feature extraction, which transforms the text data into a
numerical format that machine learning algorithms can process. This transformation involves
converting the cleaned content into a matrix of token counts, allowing the model to understand the text
data better.
4. Model Training
For the classi cation task, the Multinomial Naive Bayes classi er is selected due to its effectiveness in
dealing with text data. The dataset is split into training and testing sets, with 80% of the data used for
training and 20% for testing. The model is then trained on the training set and evaluated on the testing
set to assess its accuracy.
5. Evaluation
The performance of the model is evaluated using various metrics, including accuracy, confusion matrix,
and classi cation report. Cross-validation is also employed to validate the model's performance across
different subsets of the data. This helps ensure that the model is not over tting and can generalize well
to unseen data.
6. Hyperparameter Tuning
To further enhance the model's performance, hyperparameter tuning is conducted using Randomized
Search CV. This process identi es the best combination of parameters for the feature extraction and
classi cation algorithms, optimizing model accuracy.
Once the best model is obtained, it is evaluated against the test set. The results are visualized using bar
charts to compare correct and wrong predictions, providing insights into the model's performance and
areas for improvement.
Conclusion
The proposed system effectively utilizes various NLP and machine learning techniques to classify news
reports into different categories. By implementing a systematic approach to data preprocessing, feature
extraction, model training, and evaluation, the system achieves a robust classi cation performance.
Future work could involve exploring additional machine learning algorithms and more advanced NLP
techniques, such as transformer models, to further enhance classi cation accuracy and insight extraction
from news articles.
fi
fi
fi
fi
fi
fi
fi
fi
CHAPTER 4 : IMPLEMENTATION AND RESULT ANALYSIS
We present an analysis of the results obtained from the classification of news articles using a
Multinomial Naive Bayes model with a Bag of Words (BOW) feature extraction approach. The
performance metrics were evaluated based on the accuracy, classification report, and
visualizations of prediction outcomes.
The dataset consisted of news articles that underwent extensive preprocessing, including:
Upon loading the dataset from the English news articles, a thorough analysis was conducted to
understand its structure:
• Dataset Shape and Quality: The initial exploration revealed the dataset's dimensions,
providing insights into the number of entries and features. The rst few rows were
printed to visualize the structure and content of the data, ensuring its relevance and
integrity.
• Missing Values and Duplicates: An assessment for missing values was performed,
revealing any gaps that could affect model training. Similarly, checks for duplicate
entries were essential to maintain the quality of the dataset. The absence of signi cant
missing values or duplicates suggested that the dataset was largely clean and ready for
further processing.
One signi cant aspect of the dataset was the presence of class imbalance, where some
categories had signi cantly fewer instances than others.
To prepare the data for training, several preprocessing steps were implemented, ensuring that
the text data was clean and suitable for the model:
1. Text Cleaning:
◦ HTML Tags: Any HTML tags present in the news articles were removed using
the BeautifulSoup library to prevent interference during analysis.
◦ URLs and Emojis: URLs were stripped from the content to focus solely on the
textual information. Emojis were also removed, as they can introduce ambiguity
and noise in textual data.
◦ Punctuation and Stopwords: The removal of punctuation and common
stopwords further re ned the text, allowing the model to focus on more
meaningful words that contribute to classi cation.
2. Text Normalization:
◦ Label Encoding: The target labels (news categories) were encoded numerically
using LabelEncoder, making them suitable for model training.
After preprocessing, the data was split into training and testing sets to assess model
performance effectively:
• Model Selection: The Multinomial Naive Bayes algorithm was chosen for its suitability
for text classi cation tasks. It operates based on the probability of words occurring in
different classes, making it ideal for this type of problem.
• Model Training: The model was trained using a pipeline that combined
CountVectorizer for transforming text into feature vectors and the Naive Bayes
classi er for categorization.
• Accuracy Assessment:
◦ The classi cation report provided detailed metrics, including precision, recall,
and F1-score for each category. These metrics are crucial for understanding the
model's performance across different classes and identifying areas needing
improvement, particularly for categories that may have lower precision or recall.
Cross-Validation and Hyperparameter Tuning
To ensure the model's robustness and generalizability, additional evaluation methods were
employed:
• Cross-Validation:
◦ A bar chart comparing correct and incorrect predictions illustrated the model's
effectiveness visually. The chart showed a substantial number of correct
predictions (green) relative to incorrect ones (red), reinforcing the model's
reliability.
• Final DataFrame of Predictions:
The project effectively demonstrated the application of NLP techniques for categorizing news
reports. Through careful data preprocessing, model selection, and hyperparameter tuning, a
robust classi cation system was developed. The achieved accuracy and cross-validation results
underscore the potential of using machine learning in text classi cation tasks.
Future Directions:
The cross-validation process using Strati ed K-Folds revealed consistent performance across
different data splits. The accuracy scores for each of the three folds were as follows:
• Fold 1: 0.8806
• Fold 2: 0.8800
• Fold 3: 0.8806
This led to a mean accuracy of 0.88, indicating that the model was stable and performed well
regardless of the speci c data split. The cross-validation helped ensure that the model's
performance was not overly dependent on any single train-test split.
Fine-tuning
• max_features: None
• ngram_range: (1, 2)
• alpha: 1.92
This optimized model was then re-trained on the training data, and when tested, it achieved a
signi cant improvement in accuracy, reaching 0.933 on the test set.
Error Analysis
To further analyze the model's performance, the number of correct and incorrect predictions
was examined:
A visualization of the prediction outcomes highlighted the proportion of correct versus wrong
predictions, with correct predictions (green) vastly outnumbering incorrect ones (red).
◦ max_features: None
◦ ngram_range: (1, 2)
◦ alpha: 1.92
3. Model Accuracy: After ne-tuning, the optimized model achieved an accuracy of
0.933, with 37,246 correct predictions and 2,696 wrong predictions, re ecting a highly
effective classi cation system.
4. Potential for Improvement: Exploring other machine learning models (e.g., Support
Vector Machines or neural networks) might further enhance accuracy and reduce
misclassi cation.
The findings from this project underscore the effectiveness of using the Bag of Words (BOW)
model over the Term Frequency-Inverse Document Frequency (TF-IDF) method,
demonstrating that BOW achieved a significantly higher accuracy in text classification tasks.
Through careful hyperparameter tuning, specifically setting max_features to None,
ngram_range to (1, 2), and alpha to 0.1116, the model attained an impressive accuracy of
98.4%.
The performance metrics revealed that the model made 39,300 correct predictions, indicating a
robust classification capability, while only 642 predictions were incorrect. This high ratio of
correct predictions suggests that the model effectively captured the nuances of the dataset,
providing a strong foundation for practical applications.
Furthermore, the exploration of alternative models could potentially yield even higher accuracy
and correct prediction rates. This highlights the importance of continual experimentation and
optimization in the field of text classification. Overall, these insights affirm the value of
strategic model selection and parameter tuning in achieving superior performance in NLP tasks,
paving the way for future research and advancements in this domain.
fi
fi
fi
fi
fi
fl
Performance Metrics:
1. Accuracy
The initial accuracy score of the Multinomial Naive Bayes model with the Bag of Words
representation was approximately 0.984. This indicates that the model accurately
classified about 98.4% of the news articles in the test set, reflecting a robust
performance.
2. Hyperparameter Optimization
• max_features: None
• ngram_range: (1, 2)
• alpha: 0.1116
Using these hyperparameters significantly improved the model’s accuracy, achieving a
score of 0.984. This demonstrates the model's ability to generalize well with the selected
parameters.
4. Cross-Validation Scores
A bar chart was generated to illustrate the distribution of correct and wrong predictions. The
chart clearly displayed the high number of correct predictions compared to the relatively few
incorrect ones, reinforcing the model's effectiveness.
1. Bag of Words (BOW) vs. TF-IDF: The BOW approach outperformed the TF-IDF
method in this scenario, leading to a higher accuracy. This suggests that for this specific
dataset and classification task, the simpler BOW representation provided more relevant
features for the Naive Bayes model.
3. High Correct Predictions: The model's ability to achieve 39,300 correct predictions
with only 642 incorrect predictions indicates its reliability. Such performance is
promising for real-world applications, where high accuracy is essential for credibility.
4. Potential for Other Models: While the Multinomial Naive Bayes model performed
exceptionally well, exploring other machine learning models could potentially yield even
higher accuracy and correct predictions. Models such as Logistic Regression, Random
Forest, or advanced deep learning approaches may enhance performance further,
especially in more complex datasets.
CONCLUSION
Despite advancements, text classification faces challenges such as handling class imbalance, addressing
noisy data, and effectively representing text data in a form suitable for machine learning models.
Researchers continue to explore innovative techniques for feature extraction, such as Word Embeddings
(Word2Vec, GloVe) and Transformers (BERT, GPT), which improve classification accuracy and
contextual understanding.
The findings from this project underscore the effectiveness of using the Bag of Words (BOW) model
over the Term Frequency-Inverse Document Frequency (TF-IDF) method, demonstrating that BOW
achieved a significantly higher accuracy in text classification tasks. Through careful hyperparameter
tuning, specifically setting max_features to None, ngram_range to (1, 2), and alpha to 0.1116, the
model attained an impressive accuracy of 98.4%.
The performance metrics revealed that the model made 39,300 correct predictions, indicating a robust
classification capability, while only 642 predictions were incorrect. This high ratio of correct predictions
suggests that the model effectively captured the nuances of the dataset, providing a strong foundation
for practical applications.
Furthermore, the exploration of alternative models could potentially yield even higher accuracy and
correct prediction rates. This highlights the importance of continual experimentation and optimization in
the field of text classification. Overall, these insights affirm the value of strategic model selection and
parameter tuning in achieving superior performance in NLP tasks, paving the way for future research
and advancements in this domain.
In summary, text classification is a crucial aspect of NLP, facilitating the organization and interpretation
of vast amounts of textual information. As research and technology continue to progress, the methods
and applications of text classification are expected to expand, further enhancing its significance in
various domains.
REFERENCES
8. Sentiment Analysis and Text Classi cation Using Machine Learning Techniques
https://ieeexplore.ieee.org/document/8671886
I am especially thankful to the Computer Department for their invaluable guidance throughout this
project. The faculty’s dedication to student development has significantly influenced my learning
experience.
My sincere appreciation goes to Dr. Phiroj Shaikh, HOD and subject in-charge, whose continuous
encouragement and insightful feedback played a crucial role in shaping this work. His expertise
provided me with the clarity and direction needed to navigate the complexities of this project.
I also wish to acknowledge the contributions of my team members, whose collaboration and diverse
perspectives enriched this project significantly. Lastly, I extend my thanks to my family and friends for
their unwavering support and motivation during this journey. Thank you all for your invaluable
contributions.
1. Gavin Pereira B. E. – 47
2. Nicole Saldanha B. E. – 55
3. Ananya Solanki B. E. – 66