Unit 1 Data Mining
Unit 1 Data Mining
Data?
Data is distinct pieces of information, usually formatted in a special way”. Data
can be measured, collected, reported, and analyzed, whereupon it is often
visualized using graphs, images, or other analysis tools. Raw data (“unprocessed
data”) may be a collection of numbers or characters before it’s been “cleaned”
and corrected by researchers.
What is Information ?
Information is data that has been processed , organized, or structured in a way
that makes it meaningful, valuable and useful.
Categories of Data
Data can be catogeries into two main parts –
Structured Data: This type of data is organized data into specific format, making
it easy to search , analyze and process. Structured data is found in a relational
databases that includes information like numbers, data and categories.
UnStructured Data: Unstructured data does not conform to a specific structure
or format. It may include some text documents , images, videos, and other data
that is not easily organized or analyzed without additional processing.
What is Data Mining
Data Mining:
Definition: Data mining is the process of extracting useful patterns,
relationships, and insights from large datasets using statistical techniques,
machine learning algorithms, and database systems. It plays a crucial role in
modern industries by helping organizations uncover hidden trends that drive
data-driven decision-making.
Purpose: The primary purpose of data mining is to extract valuable knowledge
and information from large volumes of data that might be hidden or not readily
apparent. It involves using advanced statistical and machine learning techniques
to identify patterns and trends.
Functions: Data mining algorithms and techniques are applied to the data to
identify associations, clusters, classifications, and anomalies. It helps in
understanding customer behavior, predicting trends, detecting fraud, and making
data-driven business decisions.
1
Usage: Data mining is widely used in areas such as marketing analysis, customer
segmentation, recommendation systems, fraud detection, healthcare research, and
financial forecasting.
Goals of Data Mining:
• The goal of data mining is to extract useful information from large datasets
and use it to make predictions or inform decision-making.
• Data mining is important because it allows organizations to uncover
insights and trends in their data that would be difficult or impossible to
discover manually.
• This can help organizations make better decisions, improve their
operations, and gain a competitive advantage.
Data Mining History and Origins
One of the earliest and most influential pioneers of data mining was Dr. Herbert
Simon, a Nobel laureate in economics who is widely considered to be the father
of artificial intelligence. In the 1950s and 1960s, Simon and his colleagues
developed a number of algorithms and techniques for extracting useful
information and insights from data, including clustering, classification, and
decision trees.
In the 1980s and 1990s, the field of data mining continued to evolve, and new
algorithms and techniques were developed to address the challenges of working
with large and complex data sets. The development of data mining software and
platforms, such as SAS, SPSS, and RapidMiner, made it easier for organizations
to apply data mining techniques to their data.
In recent years, the availability of large data sets and the growth of cloud
computing and big data technologies have made data mining even more powerful
and widely used. Today, data mining is a crucial tool for many organizations and
industries and is used to extract valuable insights and information from data sets
in a wide range of domains.
Tasks of Data Mining
1. Classification: Categorizing data into predefined classes.
2. Clustering: Grouping similar data points together.
3. Regression: Predicting numerical values based on data relationships.
4. Association Rule Mining: Discovering interesting relationships between
variables.
2
5. Anomaly Detection: Identifying unusual patterns in data.
6. Text Mining: Extracting insights from unstructured text data.
7. Prediction and Forecasting: Predicting future trends based on historical data.
8. Pattern Mining: Identifying recurring patterns in sequential data.
9. Feature Selection and Dimensionality Reduction: Identifying relevant features
and reducing dataset complexity.
3. Data Selection and Transformation: Here, relevant data subsets are selected
for analysis based on the mining goals. The selected data may also undergo
transformation to better suit the mining algorithms.
4. Data Mining Engine: This is the core component where various data mining
algorithms are applied to the prepared data to discover patterns, trends, and
insights.
5. Pattern Evaluation: Once patterns are discovered, they need to be evaluated
for their relevance, validity, and usefulness. This step often involves statistical
techniques and domain expertise.
6. Knowledge Presentation: Finally, the discovered knowledge is presented to
users in a comprehensible format, such as reports, visualizations, or dashboards,
to aid in decision making.
Throughout this process, feedback loops may exist where insights gained from
the data mining results inform subsequent data selection, cleaning, or mining
steps, creating a continuous improvement cycle.
Data Mining Process
The Data Mining process can be explored in 5 steps.
• Step 1: Collection – First data is collected, organized, and filled into a data
warehouse. The data is stored and managed either in the cloud or in-house servers.
• Step 2: Understanding – In this step, data scientists and business analysts examine the
properties of the data and conduct an in-depth analysis from the context of a particular
problem statement as defined by the company. This is addressed using querying,
visualization, and reporting.
• Step 3: Preparation – Once the data sources of the available data are confirmed, the
data is cleared, constructed, and formatted into the required form. In this process,
additional data can also be explored at a greater depth, which is well informed by the
insights and uncovered in the previous stage.
• Step 4: Modeling – In this stage, for the prepared dataset, modeling techniques are
selected. A data model is just like a diagram that reflects and describes the
relationships between different types of information that are stored in the database.
• Step 5: Evaluation – In the context of the business objectives, the model results are
evaluated. In this phase, due to new patterns that are discovered in the model results
or other factors, new business requirements may be raised.
4
Classification of data mining
Classification Based on the mined Databases
A data mining system can be classified based on the types of databases that have
been mined. A database system can be further segmented based on distinct
principles, such as data models, types of data, etc., which further assist in
classifying a data mining system.
For example, if we want to classify a database based on the data model, we need
to select either relational, transactional, object-relational or data warehouse
mining systems.
Classification Based on the type of Knowledge Mined
A data mining system categorized based on the kind of knowledge mind may have
the following functionalities:
1. Characterization
2. Discrimination
3. Association and Correlation Analysis
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis
Classification Based on the Techniques Utilized
A data mining system can also be classified based on the type of techniques that
are being incorporated.
These techniques can be assessed based on the involvement of user interaction
involved or the methods of analysis employed.
5
Classification Based on the Applications Adapted
Data mining systems classified based on adapted applications adapted are as
follows:
1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail
What is KDD (Knowledge Discovery in Databases).
KDD is a computer science field specializing in extracting previously unknown
and interesting information from raw data. KDD is the whole process of trying to
make sense of data by developing appropriate methods or techniques. The
following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
• Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source (DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and ETL(Extract-Load-
Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Naive bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
6
• Data Mapping: Assigning elements from source base to destination to
capture transformations.
• Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It find interestingness score of
each pattern, and uses summarization and Visualization to make data
understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used
to make decisions.
7
Difference between KDD and Data Mining
Parameter KDD Data Mining
Definition KDD refers to a process of identifying Data Mining refers to a
valid, novel, potentially useful, and process of extracting useful
ultimately understandable patterns and and valuable information
relationships in data. or patterns from large data
sets.
Objective To find useful knowledge from data. To extract useful
information from data.
Techniques Data cleaning, data integration, data Association rules,
Used selection, data transformation, data classification, clustering,
mining, pattern evaluation, and regression, decision trees,
knowledge representation and neural networks, and
visualization. dimensionality reduction.
Output Structured information, such as rules and Patterns, associations, or
models, that can be used to make insights that can be used to
decisions or predictions improve decision-making
or understanding
Focus Focus is on the discovery of useful Data mining focus is on the
knowledge, rather than simply finding discovery of patterns or
patterns in data. relationships in data.
Role of Domain expertise is important in KDD, Domain expertise is
domain as it helps in defining the goals of the important in KDD, as it
expertise process, choosing appropriate data, and helps in defining the goals
interpreting the results. of the process, choosing
appropriate data, and
interpreting the results.
8
What is the difference between DBMS and Data mining?
What is OLAP?
9
➢ Applications of OLAP
➢ Database Marketing
➢ Marketing and sales analysis
It is used for future data prediction. It is used for analyzing past data.
10
What is Data Mining Techniques?
Data mining techniques are algorithms and methods used to extract information
and insights from data sets.
1. Regression
Regression is a data mining technique that is used to model the relationship
between a dependent variable and one or more independent variables. In
regression analysis, the goal is to fit a mathematical model to the data that can be
used to make predictions or forecasts about the dependent variable based on the
values of the independent variables.
There are many different types of regression models, including linear regression,
logistic regression, and non-linear regression. In general, regression models are
used to answer questions such as:
• What is the relationship between the dependent and independent variables?
• How well does the model fit the data?
• How accurate are the predictions or forecasts made by the model?
2. Classification
Classification is a data mining technique that is used to predict the class or
category of an item or instance based on its characteristics or attributes. There are
many different types of classification models, including decision trees, k-nearest
neighbours, and support vector machines. In general, classification models are
used to answer questions such as:
• What is the relationship between the classes and the attributes
• How well does the model fit the data?
11
• How accurate are the predictions made by the model?
3. Clustering
Clustering is a data mining technique that is used to group items or instances in a
data set into clusters or groups based on their similarity or proximity. In clustering
analysis, the goal is to identify and explore the natural structure or organization
of the data, and to uncover hidden patterns and relationships.
There are many different types of clustering algorithms, including k-means
clustering, hierarchical clustering, and density-based clustering. In general,
clustering is used to answer questions such as:
• What is the natural structure or organization of the data?
• What are the main clusters or groups in the data?
• How similar or dissimilar are the items in the data set?
4. Association rule mining
Association rule mining is a data mining technique that is used to identify and
explore relationships between items or attributes in a data set. In association rule
mining, the goal is to identify patterns and rules that describe the co-occurrence
or occurrence of items or attributes in the data set and to evaluate the strength and
significance of these patterns and rules.
There are many different algorithms and methods for association rule mining,
including the Apriori algorithm and the FP-growth algorithm. In general,
association rule mining is used to answer questions such as
• What are the main patterns and rules in the data?
• How strong and significant are these patterns and rules?
• What are the implications of these patterns and rules for the data set and
the domain?
5. Dimensionality Reduction
Dimensionality reduction is a data mining technique that is used to reduce the
number of dimensions or features in a data set while retaining as much
information and structure as possible. There are many different methods for
dimensionality reduction, including principal component analysis (PCA),
independent component analysis (ICA), and singular value decomposition
(SVD). In general, dimensionality reduction is used to answer questions such as:
• What are the main dimensions or features in the data set?
12
• How much information and structure can be retained in a lower-
dimensional space?
• How can the data be visualized and analyzed in a lower-dimensional space?
6. Anomaly Detection: Anomaly detection identifies outliers or anomalies in
data that deviate from normal patterns. It is used for detecting fraud, network
intrusions, and equipment failures.Techniques include statistical methods,
clustering-based approaches, and machine learning algorithms such as isolation
forests and one-class SVM.
7. Sequential Pattern Mining: Sequential pattern mining discovers patterns that
occur sequentially or temporally in data. It is used in applications such as
analyzing customer behavior over time or identifying patterns in sequences of
events.Examples include the Prefix Span algorithm and the GSP (Generalized
Sequential Pattern)algorithm.
8. Text Mining: Text mining techniques extract useful information from
unstructured text data. This includes tasks such as sentiment analysis, topic
modeling, named entity recognition, and document classification. Techniques
such as natural language processing (NLP) and machine learning algorithms are
commonly used in text mining.
Benefits of Data Mining
Improved decision-making: Data mining can provide valuable insights that can
help organizations make better decisions by identifying patterns and trends in
large data sets.
Increased efficiency: Data mining can automate repetitive and time-consuming
tasks, such as data cleaning and preparation, which can help organizations save
time and resources.
Enhanced competitiveness: Data mining can help organizations gain a
competitive edge by uncovering new business opportunities and identifying areas
for improvement.
Improved customer service: Data mining can help organizations better
understand their customers and tailor their products and services to meet their
needs.
Fraud detection: Data mining can be used to identify fraudulent activities by
detecting unusual patterns and anomalies in data.
Predictive modeling: Data mining can be used to build models that can predict
future events and trends, which can be used to make proactive decisions.
13
New product development: Data mining can be used to identify new product
opportunities by analyzing customer purchase patterns and preferences.
Risk management: Data mining can be used to identify potential risks by
analyzing data on customer behavior, market conditions, and other factors.
Challenges and Issues in Data Mining
1]Data Quality
The quality of data used in data mining is one of the most significant challenges.
The accuracy, completeness, and consistency of the data affect the accuracy of
the results obtained. The data may contain errors, omissions, duplications, or
inconsistencies, which may lead to inaccurate results.
To address these challenges, data mining practitioners must apply data cleaning
and data preprocessing techniques to improve the quality of the data
2] Data Complexity
Data complexity refers to the vast amounts of data generated by various sources,
such as sensors, social media, and the internet of things (IoT). The complexity of
the data may make it challenging to process, analyze, and understand. In addition,
the data may be in different formats, making it challenging to integrate into a
single dataset.
To address this challenge, data mining practitioners use advanced techniques such
as clustering, classification, and association rule mining.
3] Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more
data is collected, stored, and analyzed, the risk of data breaches and cyber-attacks
increases. The data may contain personal, sensitive, or confidential information
that must be protected. Moreover, data privacy regulations such as GDPR, CCPA,
and HIPAA impose strict rules on how data can be collected, used, and shared.
To address this challenge, data mining practitioners must apply data
anonymization and data encryption techniques to protect the privacy and security
of the data. Data anonymization involves removing personally identifiable
information (PII) from the data, while data encryption involves using algorithms
to encode the data to make it unreadable to unauthorized users.
4] Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As
the size of the dataset increases, the time and computational resources required to
perform data mining operations also increase.
14
To address this challenge, data mining practitioners use distributed computing
frameworks such as Hadoop and Spark.
5] Interpretability
Data mining algorithms can produce complex models that are difficult to
interpret. This is because the algorithms use a combination of statistical and
mathematical techniques to identify patterns and relationships in the data.
To address this challenge, data mining practitioners use visualization techniques
to represent the data and the models visually.
Data Mining Applications
Data mining is used by a wide range of organizations and individuals across many
different industries and domains. Some examples of who uses data mining
include:
Businesses and Enterprises – Many businesses and enterprises use data mining
to extract useful insights and information from their data, in order to make better
decisions, improve their operations, and gain a competitive advantage. For
example, a retail company might use data mining to identify customer trends and
preferences or to predict demand for its products.
Government Agencies and Organizations – Government agencies and
organizations use data mining to analyze data related to their operations and the
population they serve, in order to make better decisions and improve their
services. For example, a health department might use data mining to identify
patterns and trends in public health data or to predict the spread of infectious
diseases.
Academic and Research Institutions – Academic and research institutions use
data mining to analyze data from their research projects and experiments, in order
to identify patterns, relationships, and trends in the data. For example, a university
might use data mining to analyze data from a clinical trial or to explore the
relationships between different variables in a social science study.
Individuals – Many individuals use data mining to analyze their own data, in
order to better understand and manage their personal information and activities.
For example, a person might use data mining to analyze their financial data and
identify patterns in their spending or to analyze their social media data and
understand their online behavior and interactions.
15