Data Mining Report
Data Mining Report
Submitted By
DR.S.B.Nikam Sir
Department of Computer Engineering
Government Polytechnic Pune
YEAR 2023-24
CERTIFICAT
Of class 3rd Year in partial fulfillment of the requirement for the completion of
course Data Mining_(CM5107) - EVEN 2024 of Diploma in Computer
Engineering from Government Polytechnic, Pune. The report has been
approved as it satisfies the academic requirements in respect of Micro
Project work prescribed for the course.
ACKNOWLEDGEMENT
First and foremost, we express our heartfelt gratitude to all those who contributed to the
success of this endeavor. Our sincere appreciation goes to Government Polytechnic Pune
for providing us with guidance and invaluable advice throughout this journey. We extend
special thanks and regards to our mentor, Dr. S. B. Nikam Sir, whose unwavering support
and encouragement have been instrumental in our progress.
It is through the collective efforts of these esteemed individuals that we have been able to
develop the skills and confidence necessary to excel as third-year students. Their support
and enthusiasm have been a source of motivation for us throughout this journey.
ABSTRACT
This report presents a Micro Project focused on text classification of news articles using
Python, submitted as part of the requirements for the Data Mining course. The report
provides a comprehensive overview of text classification techniques, specifically
focusing on the application of Python for this purpose.
The report delves into the fundamentals of text classification, elucidating its significance
and relevance in the context of processing large volumes of news articles. It highlights
the importance of accurate classification in effectively organizing and analyzing vast
amounts of textual data.
Through clear and concise explanations, the report aims to equip readers with a solid
understanding of text classification techniques, as well as practical insights into
implementing these techniques using Python. Overall, this report serves as a valuable
resource for individuals seeking to explore and apply text classification methodologies in
the domain of news article analysis.
Page|
INDEX
2 Abstract 2
8 Conclusion 15
9 Reference 15
Page|4
Chapter 1 : Introduction
Data Mining:
Data mining is the process of extracting useful patterns, insights, and knowledge from
large datasets. It involves various steps such as data cleaning, integration, selection,
transformation, data mining, evaluation, presentation, and knowledge extraction.
Understanding the Domain and Defining Goals: Identifying the problem domain
and specifying the objectives of the data mining project.
Data Cleaning: Removing noise and inconsistencies from the datasets to ensure data
quality.
Data Integration: Combining data from multiple sources to create a unified dataset.
Data Selection: Selecting relevant subsets of data for analysis based on the project
goals.
Data Transformation: Converting and normalizing data into suitable formats for
analysis.
Data Mining: Applying various algorithms and techniques to discover patterns,
associations, or trends in the data.
Evaluation: Assessing the quality and effectiveness of the discovered patterns using
metrics and validation techniques.
Presentation: Communicating the findings and insights to stakeholders through
visualizations, reports, or presentations.
Knowledge Extraction: Deriving actionable knowledge and insights from the
discovered patterns to support decision-making..
Page|5
Data mining is the act of automatically searching for large stores of information to
find trends and patterns that go beyond simple analysis procedures. Data mining
utilizes complex mathematical algorithms for data segments and evaluates the
probability of future events. Data Mining is also called Knowledge Discovery of Data
(KDD).
Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful
information.
Data Mining is similar to Data Science carried out by a person, in a specific situation,
on a particular data set, with an objective. This process includes various types of
services such as text mining, web mining, audio and video mining, pictorial data
mining, and social media mining. It is done through software that is simple or highly
specific. By outsourcing data mining, all the work can be done faster with low
operation costs. Specialized firms can also use new technologies to collect data that is
impossible to locate manually.
Example:
A retail company analyzes customer purchase data to tailor marketing strategies based on
purchase patterns and customer segments, leading to increased sales and customer
satisfaction.
Page|6
Classification
Text classification techniques play a crucial role in natural language processing (NLP) by
automatically assigning predefined categories or labels to textual data based on its
content. These techniques encompass a range of methodologies, including rule-based
systems, machine learning algorithms, and deep learning approaches. Rule-based systems
rely on predefined rules or patterns to classify text, while machine learning algorithms
leverage labeled training data to learn patterns and make predictions. Deep learning
techniques, such as recurrent neural networks (RNNs) and convolutional neural networks
(CNNs), excel at capturing intricate patterns in text data.
Previous studies and research in text classification of news articles have explored various
aspects, including feature selection, model performance optimization, and domain-
specific challenges. Researchers have investigated the effectiveness of different
algorithms, feature engineering techniques, and data preprocessing methods to enhance
classification accuracy and robustness.
Reviewing relevant Python libraries and tools for text classification reveals a rich
ecosystem that facilitates the development and deployment of classification models.
Libraries like NLTK (Natural Language Toolkit), Scikit-learn, and TensorFlow offer a
wide range of functionalities for text preprocessing, feature extraction, model training,
and evaluation. Additionally, tools like spaCy provide efficient linguistic annotations and
preprocessing capabilities, while Gensim offers advanced topic modeling and similarity
analysis functionalities. The versatility and accessibility of these libraries and tools
empower researchers and practitioners to efficiently tackle text classification tasks with
Python.
Page|7
Page|8
Chapter 3:Data Collection and Preprocessing
Preprocessing Steps:
1.Removing HTML Tags:Initial preprocessing involved stripping HTML
tags from the text data, as web-scraped articles often contain HTML
formatting that is irrelevant for analysis. This step ensures that only the
textual content of the articles is retained.
Data Preprcoessing
Page|
Bar Graph
Bar charts visually represent the distribution of categories within a text classification
dataset. Each bar corresponds to a category, like business or sports, and its height
indicates the frequency of articles in that category. This visualization helps analysts
understand the dataset's category distribution quickly.
Pie Charts
Word Cloud
Word clouds display the most common terms in a text corpus, with word size indicating
frequency. In text classification, word clouds for each category highlight prevalent terms
associated with topics like business or politics. They provide a quick glimpse into the
prominent themes within each category, aiding in interpretation and analysis.
Page|
Decision Tree:
Decision tree is a tree-like structure where each internal node
represents a feature, each branch represents a decision based on
that feature, and each leaf node represents the outcome or class
label. It is a popular algorithm for classification tasks due to its
simplicity and interpretability.
Logistic Regression:
Logistic regression is a linear model used for binary
classification. It estimates the probability that a given input
belongs to a particular category. Despite its name, it is
commonly used for classification tasks and is known for its
simplicity and efficiency.
Random Forest:
Random forest is an ensemble learning method that
constructs multiple decision trees during training and
outputs the mode of the classes for classification. It
improves upon decision trees by reducing overfitting and
increasing accuracy through averaging predictions from
multiple trees.
Naive Bayes:
Naive Bayes is a probabilistic classifier based on Bayes'
theorem with the assumption of independence between
features. It is particularly effective for text classification
tasks and is known for its simplicity, speed, and scalability.
Page|
Support Vector Machines (SVM):
Support vector machines are powerful supervised learning
models used for classification and regression tasks. SVMs find
the hyperplane that best separates classes in high-dimensional
space. They are effective in cases with complex decision
boundaries and are versatile due to the use of different kernel
functions.
Python Code:
x = dataset['Text']
y = dataset['Category']
# Train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size = 0.3, random_state = 0, shuffle = True)
print(len(x_train))
print(len(x_test))
perform_list = []
# Performance metrics
accuracy = round(accuracy_score(y_test, y_pred) *
100, 2)
model_performance =
pd.DataFrame(data=perform_list)
model_performance = model_performance[['Model',
'Test Accuracy', 'Precision', 'Recall', 'F1']]
model = model_performance["Model"]
max_value = model_performance["Test
Accuracy"].max()
print("The best accuracy of model is", max_value,
"from Random")
# Fit RandomForestClassifier
classifier =
RandomForestClassifier(n_estimators=100,
criterion='entropy', random_state=0)
classifier.fit(x_train, y_train)
# Example prediction
# Example prediction
Page|
example_text = 'Hour ago, I contemplated retirement
for a lot of reasons. I felt like people were not
sensitive enough to my injuries. I felt like a lot of
people were backed, why not me? I have done no
less. I have won a lot of games for the team, and I am
not feeling backed, said Ashwin'
example_text = remove_tags(example_text)
example_text = special_char(example_text)
example_text = convert_lower(example_text)
example_text = remove_stopwords(example_text)
example_text = lemmatize_word(example_text)
example_text_vector = cv.transform([example_text])
predicted_category =
classifier.predict(example_text_vector)
print(f'Predicted category for example text: {predicted_category[0]}')
Page|
Output:
1)Metadata
2)Bar Graph
Page|
3)Pie Chart
4)Word Cloud
1)
Page|
2)
3)
Page|
4)
5)
Page|
6) Algorithm Results:
7) Best Algorithm
Page|
8)Final Prediction
Page|
CONCLUSION
In summary, the text classification of news articles employs a structured pipeline involving data
preparation, model building, and evaluation. Through the utilization of various classification
algorithms and evaluation metrics, accurate categorization of news articles is achievable. This
process facilitates efficient information retrieval and analysis, contributing to informed decision-
making and enhancing user experience in accessing news content.
REFERENCE
• Kaggle(www.kaggle.com)
• Coursera
• Data Mining Concepts and Techniques Book