0% found this document useful (0 votes)

10 views

Data Mining Report

Uploaded by

shreyasdeobhankar1909

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Data Mining Report

Uploaded by

shreyasdeobhankar1909

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Government Polytechinc,Pune

(An Autonomous Institute of Government of Maharashtra)

A Micro Project Report on

“Text Classification of News Article using

Python”

Under The Subject

DATA MINING [ CM5107 ]

COMPUTER ENGINEERING

Submitted By

2106036 – Ritesh Ganesh Chinchole

2106039 – Rushikesh Pundlik Dawale
2106040 – Shreyas Sunil Deobhankar
2106046- Vishal Sandeepan Devkate

Under The Guidance Of

DR.S.B.Nikam Sir
Department of Computer Engineering
Government Polytechnic Pune
YEAR 2023-24

CERTIFICAT

This is to certify that the mini-project work entitled

“Text Classification of News Article using
Python”

Is a bona-fide work carried out by

2106036 - Ritesh Chinchole

2106039 - Rushikesh Dawale
2106040 -Shreyas Deobhankar
2106046 - Vishal Devkate

Of class 3rd Year in partial fulfillment of the requirement for the completion of
course Data Mining_(CM5107) - EVEN 2024 of Diploma in Computer
Engineering from Government Polytechnic, Pune. The report has been
approved as it satisfies the academic requirements in respect of Micro
Project work prescribed for the course.

Micro Project Guide Head of Department Principal

(DR.S.B.NIKAM) (MRS. J. R. HANGE) (DR. R. K. PATIL)

Page|

ACKNOWLEDGEMENT
First and foremost, we express our heartfelt gratitude to all those who contributed to the
success of this endeavor. Our sincere appreciation goes to Government Polytechnic Pune
for providing us with guidance and invaluable advice throughout this journey. We extend
special thanks and regards to our mentor, Dr. S. B. Nikam Sir, whose unwavering support
and encouragement have been instrumental in our progress.

Furthermore, we would like to acknowledge the leadership of our esteemed institution,

under the guidance of Principal Rajendra Patil Sir. His visionary leadership and
unwavering commitment to excellence have inspired us to strive for greatness. We are
also grateful to our Head of Department, J. R. Hange, for her guidance and support in all
our endeavors.

It is through the collective efforts of these esteemed individuals that we have been able to
develop the skills and confidence necessary to excel as third-year students. Their support
and enthusiasm have been a source of motivation for us throughout this journey.

As we reflect on our achievements, we recognize the invaluable role played by each

member of our academic community. Together, we have achieved milestones and
overcome challenges, and for that, we are truly grateful. We look forward to continuing
our journey with the same dedication and enthusiasm, fueled by the unwavering support
of our mentors and institution.
Page|

ABSTRACT

This report presents a Micro Project focused on text classification of news articles using
Python, submitted as part of the requirements for the Data Mining course. The report
provides a comprehensive overview of text classification techniques, specifically
focusing on the application of Python for this purpose.

The report delves into the fundamentals of text classification, elucidating its significance
and relevance in the context of processing large volumes of news articles. It highlights
the importance of accurate classification in effectively organizing and analyzing vast
amounts of textual data.

A primary focus of the report is on the implementation of text classification using

Python, with a specific emphasis on the Decision Tree algorithm. The report provides a
detailed discussion on the Decision Tree algorithm's principles, methodology, and its
application in classifying news articles.

Through clear and concise explanations, the report aims to equip readers with a solid
understanding of text classification techniques, as well as practical insights into
implementing these techniques using Python. Overall, this report serves as a valuable
resource for individuals seeking to explore and apply text classification methodologies in
the domain of news article analysis.
Page|

INDEX

Sr. CONTENT PAGE NO

No
1 Acknowledgement 1

2 Abstract 2

3 Chapter 1: Introduction 4-5

4 Chapter_2: Classification 6-7

5 Chapter_3: Decision Tree Algorithm 8-10

6 Chapter_4: Implementation of Decision Tree 11-13

Algorithm
7 Chapter_5: Advantages and Disadvantages of 14
DecisionTree Algorithm

8 Conclusion 15

9 Reference 15
Page|4

Chapter 1 : Introduction

Data Mining:

Data mining is the process of extracting useful patterns, insights, and knowledge from
large datasets. It involves various steps such as data cleaning, integration, selection,
transformation, data mining, evaluation, presentation, and knowledge extraction.

Knowledge Discovery in Databases (KDD) Process:

The KDD process encompasses several stages:

Understanding the Domain and Defining Goals: Identifying the problem domain
and specifying the objectives of the data mining project.
Data Cleaning: Removing noise and inconsistencies from the datasets to ensure data
quality.
Data Integration: Combining data from multiple sources to create a unified dataset.
Data Selection: Selecting relevant subsets of data for analysis based on the project
goals.
Data Transformation: Converting and normalizing data into suitable formats for
analysis.
Data Mining: Applying various algorithms and techniques to discover patterns,
associations, or trends in the data.
Evaluation: Assessing the quality and effectiveness of the discovered patterns using
metrics and validation techniques.
Presentation: Communicating the findings and insights to stakeholders through
visualizations, reports, or presentations.
Knowledge Extraction: Deriving actionable knowledge and insights from the
discovered patterns to support decision-making..
Page|5

KDD in Data Mining

Data mining is the act of automatically searching for large stores of information to
find trends and patterns that go beyond simple analysis procedures. Data mining
utilizes complex mathematical algorithms for data segments and evaluates the
probability of future events. Data Mining is also called Knowledge Discovery of Data
(KDD).

Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful
information.

Data Mining is similar to Data Science carried out by a person, in a specific situation,
on a particular data set, with an objective. This process includes various types of
services such as text mining, web mining, audio and video mining, pictorial data
mining, and social media mining. It is done through software that is simple or highly
specific. By outsourcing data mining, all the work can be done faster with low
operation costs. Specialized firms can also use new technologies to collect data that is
impossible to locate manually.

Example:
A retail company analyzes customer purchase data to tailor marketing strategies based on
purchase patterns and customer segments, leading to increased sales and customer
satisfaction.
Page|6

Chapter 2:Exploring Text Classification Techniques

Classification
Text classification techniques play a crucial role in natural language processing (NLP) by
automatically assigning predefined categories or labels to textual data based on its
content. These techniques encompass a range of methodologies, including rule-based
systems, machine learning algorithms, and deep learning approaches. Rule-based systems
rely on predefined rules or patterns to classify text, while machine learning algorithms
leverage labeled training data to learn patterns and make predictions. Deep learning
techniques, such as recurrent neural networks (RNNs) and convolutional neural networks
(CNNs), excel at capturing intricate patterns in text data.

Previous studies and research in text classification of news articles have explored various
aspects, including feature selection, model performance optimization, and domain-
specific challenges. Researchers have investigated the effectiveness of different
algorithms, feature engineering techniques, and data preprocessing methods to enhance
classification accuracy and robustness.

Reviewing relevant Python libraries and tools for text classification reveals a rich
ecosystem that facilitates the development and deployment of classification models.
Libraries like NLTK (Natural Language Toolkit), Scikit-learn, and TensorFlow offer a
wide range of functionalities for text preprocessing, feature extraction, model training,
and evaluation. Additionally, tools like spaCy provide efficient linguistic annotations and
preprocessing capabilities, while Gensim offers advanced topic modeling and similarity
analysis functionalities. The versatility and accessibility of these libraries and tools
empower researchers and practitioners to efficiently tackle text classification tasks with
Python.
Page|7
Page|8
Chapter 3:Data Collection and Preprocessing

Description of the Dataset:

The dataset used in this project comprises news articles collected from
various sources, covering diverse topics such as business, technology,
politics, sports, and entertainment. Each article is labeled with its
corresponding category, allowing for supervised learning in text
classification tasks.

Data Collection Methods:

The dataset was obtained through web scraping techniques from reputable
news websites and archives. It involved retrieving articles from multiple
sources, ensuring a broad coverage of topics and perspectives.
Additionally, data augmentation techniques may have been employed to
increase the diversity and size of the dataset.

Preprocessing Steps:
1.Removing HTML Tags:Initial preprocessing involved stripping HTML
tags from the text data, as web-scraped articles often contain HTML
formatting that is irrelevant for analysis. This step ensures that only the
textual content of the articles is retained.

2.Handling Special Characters:Special characters, such as punctuation

marks and symbols, were addressed next. These characters can introduce
noise and hinder the effectiveness of text analysis algorithms. Therefore,
they were either removed or replaced with appropriate representations.

3.Text Lowercasing: To standardize the text and avoid redundancy in

feature space, all words were converted to lowercase. This step ensures
that words with the same meaning but different cases (e.g., "Text" and
"text") are treated identically during analysis.
Page|9
4. Stopword Removal: Stopwords, common words that contribute little
to the overall meaning of the text (e.g., "the," "is," "and"), were
eliminated from the dataset. This helps reduce noise and computational
complexity during analysis.

5.Lemmatization:Finally, lemmatization was applied to reduce words to

their base or root form. This step ensures that variations of words (e.g.,
"running," "ran," "runs") are treated as the same word, thereby
improving the accuracy and interpretability of the classification model.

By meticulously performing these preprocessing steps, the dataset was

refined and prepared for subsequent stages of analysis, such as feature
extraction and model training. This ensures that the text classification
model can effectively learn meaningful patterns and relationships from
the data, ultimately enhancing its predictive performance.

Data Preprcoessing
Page|

Chapter 4 : Visualization for Text Classification

Bar Charts for Text Classification:

Bar Graph

Bar charts visually represent the distribution of categories within a text classification
dataset. Each bar corresponds to a category, like business or sports, and its height
indicates the frequency of articles in that category. This visualization helps analysts
understand the dataset's category distribution quickly.

Pie Charts for Text Classification:

Pie Charts

Pie charts offer a visual breakdown of category proportions in a text classification

dataset. Each slice represents a category, with Its size indicating the proportion of
articles in that category. Pie charts provide a clear overview of category distribution and
help identify dominant topics in the dataset.
Page|

Word Clouds for Text Classification:

Word Cloud

Word clouds display the most common terms in a text corpus, with word size indicating
frequency. In text classification, word clouds for each category highlight prevalent terms
associated with topics like business or politics. They provide a quick glimpse into the
prominent themes within each category, aiding in interpretation and analysis.
Page|

Chapter 5 : Overview of Text Classification Algorithms

Decision Tree:
Decision tree is a tree-like structure where each internal node
represents a feature, each branch represents a decision based on
that feature, and each leaf node represents the outcome or class
label. It is a popular algorithm for classification tasks due to its
simplicity and interpretability.

Logistic Regression:
Logistic regression is a linear model used for binary
classification. It estimates the probability that a given input
belongs to a particular category. Despite its name, it is
commonly used for classification tasks and is known for its
simplicity and efficiency.

Random Forest:
Random forest is an ensemble learning method that
constructs multiple decision trees during training and
outputs the mode of the classes for classification. It
improves upon decision trees by reducing overfitting and
increasing accuracy through averaging predictions from
multiple trees.

Naive Bayes:
Naive Bayes is a probabilistic classifier based on Bayes'
theorem with the assumption of independence between
features. It is particularly effective for text classification
tasks and is known for its simplicity, speed, and scalability.
Page|
Support Vector Machines (SVM):
Support vector machines are powerful supervised learning
models used for classification and regression tasks. SVMs find
the hyperplane that best separates classes in high-dimensional
space. They are effective in cases with complex decision
boundaries and are versatile due to the use of different kernel
functions.

K-Nearest Neighbors (KNN):

K-nearest neighbors is a simple and intuitive algorithm that classifies
objects based on the majority class among their k nearest neighbors in
feature space. It is non-parametric and lazy-learning, meaning it does
not make assumptions about the underlying data distribution and does
not require a training phase.
Page|

Chapter 6 : Text Classification Pipeline: Data Preparation, Model

Building, and Evaluation
The implementation of the text classification pipeline
involves three main steps: data preparation, model
building, and model evaluation.

1.Data Preparation:In this step, the raw text data is

preprocessed to make it suitable for model training.
This includes tasks such as removing HTML tags,
handling special characters, converting text to
lowercase, removing stopwords, and lemmatization.
Additionally, the text data is vectorized using
techniques like CountVectorizer or TF-IDF
Vectorization to convert it into numerical feature
vectors.

2. Model Building: Once the data is prepared, various

classification models are built using machine learning
algorithms such as Logistic Regression, Random
Forest, Naive Bayes, Support Vector Machines
(SVM), Decision Trees, and K-Nearest Neighbors
(KNN). Each model is trained on the preprocessed
text data to learn patterns and associations between
the features and the target labels.

3. Model Evaluation:After training the models, they

are evaluated using metrics such as accuracy,
precision, recall, and F1-score to assess their
performance on unseen data. This involves splitting
the dataset into training and testing sets, making
predictions on the test data using the trained models,
and comparing the predicted labels with the actual
labels to calculate evaluation metrics.

Overall, the text classification pipeline involves preparing the data,

building machine learning models, and evaluating their performance to
determine the most effective approach for classifying text data into
predefined categories.
Page|

Chapter 7 : Code and Snippets

Python Code:
x = dataset['Text']
y = dataset['Category']

# Vectorize text data

cv = CountVectorizer(max_features=5000)
x = cv.fit_transform(dataset.Text).toarray()

# Train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size = 0.3, random_state = 0, shuffle = True)
print(len(x_train))
print(len(x_test))
perform_list = []

def run_model(model_name, est_c, est_pnlty):

mdl = ''
if model_name == 'Logistic Regression':
mdl = LogisticRegression()
elif model_name == 'Random Forest':
mdl =
RandomForestClassifier(n_estimators=100 ,criterion
='entropy' , random_state=0)
elif model_name == 'Multinomial Naive Bayes':
mdl = MultinomialNB(alpha=1.0,
fit_prior=True)
elif model_name == 'Support Vector Classifier':
mdl = SVC()
elif model_name == 'Decision Tree Classifier':
mdl = DecisionTreeClassifier(),e
elif model_name == 'K Nearest Neighbour':
Page|
mdl = KNeighborsClassifier(n_neighbors=10 ,
metric= 'minkowski' , p = 4)
elif model_name == 'Gaussian Naive Bayes':
mdl = GaussianNB()

# Fit the model

mdl.fit(x_train, y_train)
y_pred = mdl.predict(x_test)

# Performance metrics
accuracy = round(accuracy_score(y_test, y_pred) *
100, 2)

# Get precision, recall, f1 scores

precision, recall, f1score, support = score(y_test,
y_pred, average='micro')

print(f'Test Accuracy Score of Basic

{model_name}: % {accuracy}')
print(f'Precision : {precision}')
print(f'Recall : {recall}')
print(f'F1-score : {f1score}')

# Add performance parameters to list

perform_list.append(dict([
('Model', model_name),
('Test Accuracy', round(accuracy, 2)),
('Precision', round(precision, 2)),
('Recall', round(recall, 2)),
('F1', round(f1score, 2))
]))

# Call run_model for each model

run_model('Logistic Regression', est_c=None,
est_pnlty=None)
Page|
run_model('Random Forest', est_c=None,
est_pnlty=None)
run_model('Multinomial Naive Bayes', est_c=None,
est_pnlty=None)
run_model('Support Vector Classifier', est_c=None,
est_pnlty=None)
run_model('Decision Tree Classifier', est_c=None,
est_pnlty=None)
run_model('K Nearest Neighbour', est_c=None,
est_pnlty=None)
run_model('Gaussian Naive Bayes', est_c=None,
est_pnlty=None)

model_performance =
pd.DataFrame(data=perform_list)
model_performance = model_performance[['Model',
'Test Accuracy', 'Precision', 'Recall', 'F1']]

model = model_performance["Model"]
max_value = model_performance["Test
Accuracy"].max()
print("The best accuracy of model is", max_value,
"from Random")

# Fit RandomForestClassifier
classifier =
RandomForestClassifier(n_estimators=100,
criterion='entropy', random_state=0)
classifier.fit(x_train, y_train)

# Example prediction

# Draw word clouds for each category

# Example prediction
Page|
example_text = 'Hour ago, I contemplated retirement
for a lot of reasons. I felt like people were not
sensitive enough to my injuries. I felt like a lot of
people were backed, why not me? I have done no
less. I have won a lot of games for the team, and I am
not feeling backed, said Ashwin'
example_text = remove_tags(example_text)
example_text = special_char(example_text)
example_text = convert_lower(example_text)
example_text = remove_stopwords(example_text)
example_text = lemmatize_word(example_text)
example_text_vector = cv.transform([example_text])
predicted_category =
classifier.predict(example_text_vector)
print(f'Predicted category for example text: {predicted_category[0]}')
Page|

Output:
1)Metadata

2)Bar Graph
Page|
3)Pie Chart

4)Word Cloud

1)
Page|
2)

3)
Page|

5)
Page|

6) Algorithm Results:

i)Basic Logistic Regression:

ii) Basic Random Forerst

iii) Basic Multinomial Native Bayes

Page|
iv) Basic Support Vector Classifier

v)Basic Decision Tree Classifier

vi) Basic K Nearest Neighbour

vii) Basic Gassian Native Bayes

7) Best Algorithm
Page|

8)Final Prediction
Page|
CONCLUSION

In summary, the text classification of news articles employs a structured pipeline involving data
preparation, model building, and evaluation. Through the utilization of various classification
algorithms and evaluation metrics, accurate categorization of news articles is achievable. This
process facilitates efficient information retrieval and analysis, contributing to informed decision-
making and enhancing user experience in accessing news content.

REFERENCE
• Kaggle(www.kaggle.com)
• Coursera
• Data Mining Concepts and Techniques Book

My Revision Notes Edexcel A Level Economics Third - 221109 - 141748
100% (6)
My Revision Notes Edexcel A Level Economics Third - 221109 - 141748
259 pages
Margo Depression Era Lists
100% (2)
Margo Depression Era Lists
8 pages
Loan Approval Predictor Using Data Science and Machine Learning Project
100% (1)
Loan Approval Predictor Using Data Science and Machine Learning Project
66 pages
Preparation of Buffer Solutions
No ratings yet
Preparation of Buffer Solutions
2 pages
Data Mining
No ratings yet
Data Mining
34 pages
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
No ratings yet
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
53 pages
Sat - 100.Pdf - Prediction of Cyber Attacks Using Data Science Technique
No ratings yet
Sat - 100.Pdf - Prediction of Cyber Attacks Using Data Science Technique
11 pages
17 - Project Report - NLP-2-27
No ratings yet
17 - Project Report - NLP-2-27
26 pages
Submitted in Partial Fulfillment of The Requirement For The Award of The Degree of
No ratings yet
Submitted in Partial Fulfillment of The Requirement For The Award of The Degree of
22 pages
Learning Data Mining With Python - Sample Chapter
100% (4)
Learning Data Mining With Python - Sample Chapter
29 pages
Sat - 34.Pdf - A Systematic Approach Towards Description and Classification of Crime Incidents
No ratings yet
Sat - 34.Pdf - A Systematic Approach Towards Description and Classification of Crime Incidents
11 pages
Data Mining with Python Theory Application and Case Studies 1st Edition Di Wu - Download the ebook in PDF with all chapters to read anytime
100% (2)
Data Mining with Python Theory Application and Case Studies 1st Edition Di Wu - Download the ebook in PDF with all chapters to read anytime
86 pages
Industrial Training Report (Sahil)
No ratings yet
Industrial Training Report (Sahil)
33 pages
nagamani
No ratings yet
nagamani
22 pages
Data Mining With Python (2024)
No ratings yet
Data Mining With Python (2024)
415 pages
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
100% (1)
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
704 pages
Musa IEEE
No ratings yet
Musa IEEE
6 pages
Payal Technical Seminar Final
No ratings yet
Payal Technical Seminar Final
23 pages
Download ebooks file Data Mining with Python Theory Application and Case Studies 1st Edition Di Wu all chapters
No ratings yet
Download ebooks file Data Mining with Python Theory Application and Case Studies 1st Edition Di Wu all chapters
91 pages
DATA MINING JNTUH CSE R18
No ratings yet
DATA MINING JNTUH CSE R18
20 pages
A Complete Process of Text Classification System Using State‐of‐the‐Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State‐of‐the‐Art NLP Models
26 pages
ML Summer Training
No ratings yet
ML Summer Training
20 pages
Project Proposal - Group 17-2-5
No ratings yet
Project Proposal - Group 17-2-5
4 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Fake News Detection Using Machine Learning
No ratings yet
Fake News Detection Using Machine Learning
11 pages
(Ebook) Data Mining with Python: Theory, Application, and Case Studies by Di Wu ISBN 9781032612645, 9781032598901, 1032598905, 1032612649 All Chapters Instant Download
100% (4)
(Ebook) Data Mining with Python: Theory, Application, and Case Studies by Di Wu ISBN 9781032612645, 9781032598901, 1032598905, 1032612649 All Chapters Instant Download
81 pages
wibd
No ratings yet
wibd
10 pages
Ijermt Jan2019
No ratings yet
Ijermt Jan2019
9 pages
mining text data and classificatin
No ratings yet
mining text data and classificatin
4 pages
pdf
No ratings yet
pdf
7 pages
A Comparative Analysis of Machine Learning Techniques On Fake News Detection 1
No ratings yet
A Comparative Analysis of Machine Learning Techniques On Fake News Detection 1
42 pages
File of ML
No ratings yet
File of ML
42 pages
Lecture 1.3 1.4
No ratings yet
Lecture 1.3 1.4
16 pages
Group
No ratings yet
Group
43 pages
(Ebook) Data Mining with Python: Theory, Application, and Case Studies by Di Wu ISBN 9781032612645, 9781032598901, 1032598905, 1032612649pdf download
100% (3)
(Ebook) Data Mining with Python: Theory, Application, and Case Studies by Di Wu ISBN 9781032612645, 9781032598901, 1032598905, 1032612649pdf download
35 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Cyber Defense11
No ratings yet
Cyber Defense11
59 pages
MLDM Lect1 Introduction
No ratings yet
MLDM Lect1 Introduction
40 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Twitter_Dataset_for_Advanced_Hate_Speech_Identification__Final
No ratings yet
Twitter_Dataset_for_Advanced_Hate_Speech_Identification__Final
78 pages
research paper 3
No ratings yet
research paper 3
7 pages
Project Report Hate
100% (1)
Project Report Hate
24 pages
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
Text Mining Introduction
No ratings yet
Text Mining Introduction
6 pages
Data Mining with Python Theory Application and Case Studies 1st Edition Di Wu download
No ratings yet
Data Mining with Python Theory Application and Case Studies 1st Edition Di Wu download
75 pages
Machine Learning Mastery for Engineers
From Everand
Machine Learning Mastery for Engineers
Abdellatif Sadeq
No ratings yet
Report Data Analysis
No ratings yet
Report Data Analysis
45 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Data Mining UNIT-2 Notes
No ratings yet
Data Mining UNIT-2 Notes
91 pages
Implementing the Stakeholder Based Goal-Question-Metric (Gqm) Measurement Model for Software Projects
From Everand
Implementing the Stakeholder Based Goal-Question-Metric (Gqm) Measurement Model for Software Projects
Dr. Prashanth Harish Southekal
No ratings yet
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
DATA MINING MODULE 3
No ratings yet
DATA MINING MODULE 3
27 pages
20CB913 Machine Learning Module 2
No ratings yet
20CB913 Machine Learning Module 2
52 pages
Classification: Unit-III
No ratings yet
Classification: Unit-III
90 pages
Text Classification PDF
No ratings yet
Text Classification PDF
7 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
16 pages
Adarsh Mooc Report 4th Sem
No ratings yet
Adarsh Mooc Report 4th Sem
13 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Jurnal
No ratings yet
Jurnal
19 pages
Text Mining and Its Applications
No ratings yet
Text Mining and Its Applications
5 pages
Decision Tree For The Weather Forecasting
No ratings yet
Decision Tree For The Weather Forecasting
4 pages
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
UNIT 1 (1)
No ratings yet
UNIT 1 (1)
59 pages
Android 14
No ratings yet
Android 14
5 pages
Java 2 Practical 8
No ratings yet
Java 2 Practical 8
12 pages
Software-Project-Scheduling 3.4 H3
No ratings yet
Software-Project-Scheduling 3.4 H3
30 pages
Computer Security
No ratings yet
Computer Security
7 pages
Acn Final MP G3 23 ODD
No ratings yet
Acn Final MP G3 23 ODD
18 pages
11 Authorization
No ratings yet
11 Authorization
5 pages
En BeoLink Handbook v1 7
No ratings yet
En BeoLink Handbook v1 7
196 pages
Alignment and Adjustments: 1-1 Tuner
No ratings yet
Alignment and Adjustments: 1-1 Tuner
3 pages
712 Series Pilot Ignition Systems (Flame Rectification) : Installation Data
No ratings yet
712 Series Pilot Ignition Systems (Flame Rectification) : Installation Data
16 pages
Operations Planning and Control
No ratings yet
Operations Planning and Control
14 pages
Poe, Iot, Emergency Services and Emerging Technologies: Denise L. Pappas Valcom/Keltron
No ratings yet
Poe, Iot, Emergency Services and Emerging Technologies: Denise L. Pappas Valcom/Keltron
49 pages
JAP255 Doc 2
No ratings yet
JAP255 Doc 2
3 pages
Mba Marketing Thesis Samples
100% (3)
Mba Marketing Thesis Samples
6 pages
1.2 CSRF-Slides
No ratings yet
1.2 CSRF-Slides
40 pages
(Plumbing Code) Practice Problem # 1
No ratings yet
(Plumbing Code) Practice Problem # 1
17 pages
Skott P. Structuralist and Behavioral Macroeconomics 2023
No ratings yet
Skott P. Structuralist and Behavioral Macroeconomics 2023
396 pages
Analysis For Pantene Shampoo EDITED
No ratings yet
Analysis For Pantene Shampoo EDITED
8 pages
Orallo Grp. Chapter 1
No ratings yet
Orallo Grp. Chapter 1
6 pages
Digital Self PPT 1
No ratings yet
Digital Self PPT 1
24 pages
2019 Fall
No ratings yet
2019 Fall
116 pages
LABSHEET-9 Introduction To Network Simulator (OPNET)
No ratings yet
LABSHEET-9 Introduction To Network Simulator (OPNET)
18 pages
Solis_datasheet_S5-GR1P(7-10)K_Global_V4,0_2023_11
No ratings yet
Solis_datasheet_S5-GR1P(7-10)K_Global_V4,0_2023_11
2 pages
Draft An Affidavit
No ratings yet
Draft An Affidavit
5 pages
10 Steps Your Software Implementation Should Have - LISO BLOG
No ratings yet
10 Steps Your Software Implementation Should Have - LISO BLOG
19 pages
SoA Calendar 2022
No ratings yet
SoA Calendar 2022
137 pages
Algebra - DPP 03 - MBA Elite 2024
No ratings yet
Algebra - DPP 03 - MBA Elite 2024
10 pages
Water Demand Notes
No ratings yet
Water Demand Notes
10 pages
Professional Looking Resume
100% (1)
Professional Looking Resume
8 pages
Gruvlok Water-Wastewater Products
No ratings yet
Gruvlok Water-Wastewater Products
16 pages
Full Strategic Management Concepts and Cases Rothaermel Rothaermel 1st Edition Test Bank All Chapters
100% (6)
Full Strategic Management Concepts and Cases Rothaermel Rothaermel 1st Edition Test Bank All Chapters
55 pages
C04831 HD512-7 e
No ratings yet
C04831 HD512-7 e
6 pages
Yaba Tech
No ratings yet
Yaba Tech
13 pages

Data Mining Report

Uploaded by

Data Mining Report

Uploaded by

Government Polytechinc,Pune

(An Autonomous Institute of Government of Maharashtra)

A Micro Project Report on

“Text Classification of News Article using

Under The Subject

DATA MINING [ CM5107 ]

2106036 – Ritesh Ganesh Chinchole

Under The Guidance Of

This is to certify that the mini-project work entitled

Is a bona-fide work carried out by

2106036 - Ritesh Chinchole

Micro Project Guide Head of Department Principal

(DR.S.B.NIKAM) (MRS. J. R. HANGE) (DR. R. K. PATIL)

Furthermore, we would like to acknowledge the leadership of our esteemed institution,

As we reflect on our achievements, we recognize the invaluable role played by each

A primary focus of the report is on the implementation of text classification using

Sr. CONTENT PAGE NO

3 Chapter 1: Introduction 4-5

4 Chapter_2: Classification 6-7

5 Chapter_3: Decision Tree Algorithm 8-10

6 Chapter_4: Implementation of Decision Tree 11-13

Knowledge Discovery in Databases (KDD) Process:

KDD in Data Mining

Chapter 2:Exploring Text Classification Techniques

Description of the Dataset:

Data Collection Methods:

2.Handling Special Characters:Special characters, such as punctuation

3.Text Lowercasing: To standardize the text and avoid redundancy in

5.Lemmatization:Finally, lemmatization was applied to reduce words to

By meticulously performing these preprocessing steps, the dataset was

Chapter 4 : Visualization for Text Classification

Bar Charts for Text Classification:

Pie Charts for Text Classification:

Pie charts offer a visual breakdown of category proportions in a text classification

Word Clouds for Text Classification:

Chapter 5 : Overview of Text Classification Algorithms

K-Nearest Neighbors (KNN):

Chapter 6 : Text Classification Pipeline: Data Preparation, Model

1.Data Preparation:In this step, the raw text data is

2. Model Building: Once the data is prepared, various

3. Model Evaluation:After training the models, they

Overall, the text classification pipeline involves preparing the data,

Chapter 7 : Code and Snippets

# Vectorize text data

def run_model(model_name, est_c, est_pnlty):

# Fit the model

# Get precision, recall, f1 scores

print(f'Test Accuracy Score of Basic

# Add performance parameters to list

# Call run_model for each model

# Draw word clouds for each category

i)Basic Logistic Regression:

ii) Basic Random Forerst

iii) Basic Multinomial Native Bayes

v)Basic Decision Tree Classifier

vi) Basic K Nearest Neighbour

vii) Basic Gassian Native Bayes

You might also like