0% found this document useful (0 votes)
10 views

Data Mining Report

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Data Mining Report

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Government Polytechinc,Pune

(An Autonomous Institute of Government of Maharashtra)

A Micro Project Report on

“Text Classification of News Article using


Python”

Under The Subject

DATA MINING [ CM5107 ]


COMPUTER ENGINEERING

Submitted By

2106036 – Ritesh Ganesh Chinchole


2106039 – Rushikesh Pundlik Dawale
2106040 – Shreyas Sunil Deobhankar
2106046- Vishal Sandeepan Devkate

Under The Guidance Of

DR.S.B.Nikam Sir
Department of Computer Engineering
Government Polytechnic Pune
YEAR 2023-24

CERTIFICAT

This is to certify that the mini-project work entitled


“Text Classification of News Article using
Python”

Is a bona-fide work carried out by

2106036 - Ritesh Chinchole


2106039 - Rushikesh Dawale
2106040 -Shreyas Deobhankar
2106046 - Vishal Devkate

Of class 3rd Year in partial fulfillment of the requirement for the completion of
course Data Mining_(CM5107) - EVEN 2024 of Diploma in Computer
Engineering from Government Polytechnic, Pune. The report has been
approved as it satisfies the academic requirements in respect of Micro
Project work prescribed for the course.

Micro Project Guide Head of Department Principal

(DR.S.B.NIKAM) (MRS. J. R. HANGE) (DR. R. K. PATIL)


Page|

ACKNOWLEDGEMENT
First and foremost, we express our heartfelt gratitude to all those who contributed to the
success of this endeavor. Our sincere appreciation goes to Government Polytechnic Pune
for providing us with guidance and invaluable advice throughout this journey. We extend
special thanks and regards to our mentor, Dr. S. B. Nikam Sir, whose unwavering support
and encouragement have been instrumental in our progress.

Furthermore, we would like to acknowledge the leadership of our esteemed institution,


under the guidance of Principal Rajendra Patil Sir. His visionary leadership and
unwavering commitment to excellence have inspired us to strive for greatness. We are
also grateful to our Head of Department, J. R. Hange, for her guidance and support in all
our endeavors.

It is through the collective efforts of these esteemed individuals that we have been able to
develop the skills and confidence necessary to excel as third-year students. Their support
and enthusiasm have been a source of motivation for us throughout this journey.

As we reflect on our achievements, we recognize the invaluable role played by each


member of our academic community. Together, we have achieved milestones and
overcome challenges, and for that, we are truly grateful. We look forward to continuing
our journey with the same dedication and enthusiasm, fueled by the unwavering support
of our mentors and institution.
Page|

ABSTRACT

This report presents a Micro Project focused on text classification of news articles using
Python, submitted as part of the requirements for the Data Mining course. The report
provides a comprehensive overview of text classification techniques, specifically
focusing on the application of Python for this purpose.

The report delves into the fundamentals of text classification, elucidating its significance
and relevance in the context of processing large volumes of news articles. It highlights
the importance of accurate classification in effectively organizing and analyzing vast
amounts of textual data.

A primary focus of the report is on the implementation of text classification using


Python, with a specific emphasis on the Decision Tree algorithm. The report provides a
detailed discussion on the Decision Tree algorithm's principles, methodology, and its
application in classifying news articles.

Through clear and concise explanations, the report aims to equip readers with a solid
understanding of text classification techniques, as well as practical insights into
implementing these techniques using Python. Overall, this report serves as a valuable
resource for individuals seeking to explore and apply text classification methodologies in
the domain of news article analysis.
Page|

INDEX

Sr. CONTENT PAGE NO


No
1 Acknowledgement 1

2 Abstract 2

3 Chapter 1: Introduction 4-5

4 Chapter_2: Classification 6-7

5 Chapter_3: Decision Tree Algorithm 8-10

6 Chapter_4: Implementation of Decision Tree 11-13


Algorithm
7 Chapter_5: Advantages and Disadvantages of 14
DecisionTree Algorithm

8 Conclusion 15

9 Reference 15
Page|4

Chapter 1 : Introduction

Data Mining:

Data mining is the process of extracting useful patterns, insights, and knowledge from
large datasets. It involves various steps such as data cleaning, integration, selection,
transformation, data mining, evaluation, presentation, and knowledge extraction.

Knowledge Discovery in Databases (KDD) Process:


The KDD process encompasses several stages:

Understanding the Domain and Defining Goals: Identifying the problem domain
and specifying the objectives of the data mining project.
Data Cleaning: Removing noise and inconsistencies from the datasets to ensure data
quality.
Data Integration: Combining data from multiple sources to create a unified dataset.
Data Selection: Selecting relevant subsets of data for analysis based on the project
goals.
Data Transformation: Converting and normalizing data into suitable formats for
analysis.
Data Mining: Applying various algorithms and techniques to discover patterns,
associations, or trends in the data.
Evaluation: Assessing the quality and effectiveness of the discovered patterns using
metrics and validation techniques.
Presentation: Communicating the findings and insights to stakeholders through
visualizations, reports, or presentations.
Knowledge Extraction: Deriving actionable knowledge and insights from the
discovered patterns to support decision-making..
Page|5

KDD in Data Mining

Data mining is the act of automatically searching for large stores of information to
find trends and patterns that go beyond simple analysis procedures. Data mining
utilizes complex mathematical algorithms for data segments and evaluates the
probability of future events. Data Mining is also called Knowledge Discovery of Data
(KDD).

Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful
information.

Data Mining is similar to Data Science carried out by a person, in a specific situation,
on a particular data set, with an objective. This process includes various types of
services such as text mining, web mining, audio and video mining, pictorial data
mining, and social media mining. It is done through software that is simple or highly
specific. By outsourcing data mining, all the work can be done faster with low
operation costs. Specialized firms can also use new technologies to collect data that is
impossible to locate manually.

Example:
A retail company analyzes customer purchase data to tailor marketing strategies based on
purchase patterns and customer segments, leading to increased sales and customer
satisfaction.
Page|6

Chapter 2:Exploring Text Classification Techniques

Classification
Text classification techniques play a crucial role in natural language processing (NLP) by
automatically assigning predefined categories or labels to textual data based on its
content. These techniques encompass a range of methodologies, including rule-based
systems, machine learning algorithms, and deep learning approaches. Rule-based systems
rely on predefined rules or patterns to classify text, while machine learning algorithms
leverage labeled training data to learn patterns and make predictions. Deep learning
techniques, such as recurrent neural networks (RNNs) and convolutional neural networks
(CNNs), excel at capturing intricate patterns in text data.

Previous studies and research in text classification of news articles have explored various
aspects, including feature selection, model performance optimization, and domain-
specific challenges. Researchers have investigated the effectiveness of different
algorithms, feature engineering techniques, and data preprocessing methods to enhance
classification accuracy and robustness.

Reviewing relevant Python libraries and tools for text classification reveals a rich
ecosystem that facilitates the development and deployment of classification models.
Libraries like NLTK (Natural Language Toolkit), Scikit-learn, and TensorFlow offer a
wide range of functionalities for text preprocessing, feature extraction, model training,
and evaluation. Additionally, tools like spaCy provide efficient linguistic annotations and
preprocessing capabilities, while Gensim offers advanced topic modeling and similarity
analysis functionalities. The versatility and accessibility of these libraries and tools
empower researchers and practitioners to efficiently tackle text classification tasks with
Python.
Page|7
Page|8
Chapter 3:Data Collection and Preprocessing

Description of the Dataset:


The dataset used in this project comprises news articles collected from
various sources, covering diverse topics such as business, technology,
politics, sports, and entertainment. Each article is labeled with its
corresponding category, allowing for supervised learning in text
classification tasks.

Data Collection Methods:


The dataset was obtained through web scraping techniques from reputable
news websites and archives. It involved retrieving articles from multiple
sources, ensuring a broad coverage of topics and perspectives.
Additionally, data augmentation techniques may have been employed to
increase the diversity and size of the dataset.

Preprocessing Steps:
1.Removing HTML Tags:Initial preprocessing involved stripping HTML
tags from the text data, as web-scraped articles often contain HTML
formatting that is irrelevant for analysis. This step ensures that only the
textual content of the articles is retained.

2.Handling Special Characters:Special characters, such as punctuation


marks and symbols, were addressed next. These characters can introduce
noise and hinder the effectiveness of text analysis algorithms. Therefore,
they were either removed or replaced with appropriate representations.

3.Text Lowercasing: To standardize the text and avoid redundancy in


feature space, all words were converted to lowercase. This step ensures
that words with the same meaning but different cases (e.g., "Text" and
"text") are treated identically during analysis.
Page|9
4. Stopword Removal: Stopwords, common words that contribute little
to the overall meaning of the text (e.g., "the," "is," "and"), were
eliminated from the dataset. This helps reduce noise and computational
complexity during analysis.

5.Lemmatization:Finally, lemmatization was applied to reduce words to


their base or root form. This step ensures that variations of words (e.g.,
"running," "ran," "runs") are treated as the same word, thereby
improving the accuracy and interpretability of the classification model.

By meticulously performing these preprocessing steps, the dataset was


refined and prepared for subsequent stages of analysis, such as feature
extraction and model training. This ensures that the text classification
model can effectively learn meaningful patterns and relationships from
the data, ultimately enhancing its predictive performance.

Data Preprcoessing
Page|

Chapter 4 : Visualization for Text Classification

Bar Charts for Text Classification:

Bar Graph

Bar charts visually represent the distribution of categories within a text classification
dataset. Each bar corresponds to a category, like business or sports, and its height
indicates the frequency of articles in that category. This visualization helps analysts
understand the dataset's category distribution quickly.

Pie Charts for Text Classification:

Pie Charts

Pie charts offer a visual breakdown of category proportions in a text classification


dataset. Each slice represents a category, with Its size indicating the proportion of
articles in that category. Pie charts provide a clear overview of category distribution and
help identify dominant topics in the dataset.
Page|

Word Clouds for Text Classification:

Word Cloud

Word clouds display the most common terms in a text corpus, with word size indicating
frequency. In text classification, word clouds for each category highlight prevalent terms
associated with topics like business or politics. They provide a quick glimpse into the
prominent themes within each category, aiding in interpretation and analysis.
Page|

Chapter 5 : Overview of Text Classification Algorithms

Decision Tree:
Decision tree is a tree-like structure where each internal node
represents a feature, each branch represents a decision based on
that feature, and each leaf node represents the outcome or class
label. It is a popular algorithm for classification tasks due to its
simplicity and interpretability.

Logistic Regression:
Logistic regression is a linear model used for binary
classification. It estimates the probability that a given input
belongs to a particular category. Despite its name, it is
commonly used for classification tasks and is known for its
simplicity and efficiency.

Random Forest:
Random forest is an ensemble learning method that
constructs multiple decision trees during training and
outputs the mode of the classes for classification. It
improves upon decision trees by reducing overfitting and
increasing accuracy through averaging predictions from
multiple trees.

Naive Bayes:
Naive Bayes is a probabilistic classifier based on Bayes'
theorem with the assumption of independence between
features. It is particularly effective for text classification
tasks and is known for its simplicity, speed, and scalability.
Page|
Support Vector Machines (SVM):
Support vector machines are powerful supervised learning
models used for classification and regression tasks. SVMs find
the hyperplane that best separates classes in high-dimensional
space. They are effective in cases with complex decision
boundaries and are versatile due to the use of different kernel
functions.

K-Nearest Neighbors (KNN):


K-nearest neighbors is a simple and intuitive algorithm that classifies
objects based on the majority class among their k nearest neighbors in
feature space. It is non-parametric and lazy-learning, meaning it does
not make assumptions about the underlying data distribution and does
not require a training phase.
Page|

Chapter 6 : Text Classification Pipeline: Data Preparation, Model


Building, and Evaluation
The implementation of the text classification pipeline
involves three main steps: data preparation, model
building, and model evaluation.

1.Data Preparation:In this step, the raw text data is


preprocessed to make it suitable for model training.
This includes tasks such as removing HTML tags,
handling special characters, converting text to
lowercase, removing stopwords, and lemmatization.
Additionally, the text data is vectorized using
techniques like CountVectorizer or TF-IDF
Vectorization to convert it into numerical feature
vectors.

2. Model Building: Once the data is prepared, various


classification models are built using machine learning
algorithms such as Logistic Regression, Random
Forest, Naive Bayes, Support Vector Machines
(SVM), Decision Trees, and K-Nearest Neighbors
(KNN). Each model is trained on the preprocessed
text data to learn patterns and associations between
the features and the target labels.

3. Model Evaluation:After training the models, they


are evaluated using metrics such as accuracy,
precision, recall, and F1-score to assess their
performance on unseen data. This involves splitting
the dataset into training and testing sets, making
predictions on the test data using the trained models,
and comparing the predicted labels with the actual
labels to calculate evaluation metrics.

Overall, the text classification pipeline involves preparing the data,


building machine learning models, and evaluating their performance to
determine the most effective approach for classifying text data into
predefined categories.
Page|

Chapter 7 : Code and Snippets

Python Code:
x = dataset['Text']
y = dataset['Category']

# Vectorize text data


cv = CountVectorizer(max_features=5000)
x = cv.fit_transform(dataset.Text).toarray()

# Train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size = 0.3, random_state = 0, shuffle = True)
print(len(x_train))
print(len(x_test))
perform_list = []

def run_model(model_name, est_c, est_pnlty):


mdl = ''
if model_name == 'Logistic Regression':
mdl = LogisticRegression()
elif model_name == 'Random Forest':
mdl =
RandomForestClassifier(n_estimators=100 ,criterion
='entropy' , random_state=0)
elif model_name == 'Multinomial Naive Bayes':
mdl = MultinomialNB(alpha=1.0,
fit_prior=True)
elif model_name == 'Support Vector Classifier':
mdl = SVC()
elif model_name == 'Decision Tree Classifier':
mdl = DecisionTreeClassifier(),e
elif model_name == 'K Nearest Neighbour':
Page|
mdl = KNeighborsClassifier(n_neighbors=10 ,
metric= 'minkowski' , p = 4)
elif model_name == 'Gaussian Naive Bayes':
mdl = GaussianNB()

# Fit the model


mdl.fit(x_train, y_train)
y_pred = mdl.predict(x_test)

# Performance metrics
accuracy = round(accuracy_score(y_test, y_pred) *
100, 2)

# Get precision, recall, f1 scores


precision, recall, f1score, support = score(y_test,
y_pred, average='micro')

print(f'Test Accuracy Score of Basic


{model_name}: % {accuracy}')
print(f'Precision : {precision}')
print(f'Recall : {recall}')
print(f'F1-score : {f1score}')

# Add performance parameters to list


perform_list.append(dict([
('Model', model_name),
('Test Accuracy', round(accuracy, 2)),
('Precision', round(precision, 2)),
('Recall', round(recall, 2)),
('F1', round(f1score, 2))
]))

# Call run_model for each model


run_model('Logistic Regression', est_c=None,
est_pnlty=None)
Page|
run_model('Random Forest', est_c=None,
est_pnlty=None)
run_model('Multinomial Naive Bayes', est_c=None,
est_pnlty=None)
run_model('Support Vector Classifier', est_c=None,
est_pnlty=None)
run_model('Decision Tree Classifier', est_c=None,
est_pnlty=None)
run_model('K Nearest Neighbour', est_c=None,
est_pnlty=None)
run_model('Gaussian Naive Bayes', est_c=None,
est_pnlty=None)

model_performance =
pd.DataFrame(data=perform_list)
model_performance = model_performance[['Model',
'Test Accuracy', 'Precision', 'Recall', 'F1']]

model = model_performance["Model"]
max_value = model_performance["Test
Accuracy"].max()
print("The best accuracy of model is", max_value,
"from Random")

# Fit RandomForestClassifier
classifier =
RandomForestClassifier(n_estimators=100,
criterion='entropy', random_state=0)
classifier.fit(x_train, y_train)

# Example prediction

# Draw word clouds for each category

# Example prediction
Page|
example_text = 'Hour ago, I contemplated retirement
for a lot of reasons. I felt like people were not
sensitive enough to my injuries. I felt like a lot of
people were backed, why not me? I have done no
less. I have won a lot of games for the team, and I am
not feeling backed, said Ashwin'
example_text = remove_tags(example_text)
example_text = special_char(example_text)
example_text = convert_lower(example_text)
example_text = remove_stopwords(example_text)
example_text = lemmatize_word(example_text)
example_text_vector = cv.transform([example_text])
predicted_category =
classifier.predict(example_text_vector)
print(f'Predicted category for example text: {predicted_category[0]}')
Page|

Output:
1)Metadata

2)Bar Graph
Page|
3)Pie Chart

4)Word Cloud

1)
Page|
2)

3)
Page|

4)

5)
Page|

6) Algorithm Results:

i)Basic Logistic Regression:

ii) Basic Random Forerst

iii) Basic Multinomial Native Bayes


Page|
iv) Basic Support Vector Classifier

v)Basic Decision Tree Classifier

vi) Basic K Nearest Neighbour

vii) Basic Gassian Native Bayes

7) Best Algorithm
Page|

8)Final Prediction
Page|
CONCLUSION

In summary, the text classification of news articles employs a structured pipeline involving data
preparation, model building, and evaluation. Through the utilization of various classification
algorithms and evaluation metrics, accurate categorization of news articles is achievable. This
process facilitates efficient information retrieval and analysis, contributing to informed decision-
making and enhancing user experience in accessing news content.

REFERENCE
• Kaggle(www.kaggle.com)
• Coursera
• Data Mining Concepts and Techniques Book

You might also like