1. Course title: Introduction to Data Mining and Data Warehousing
CHAPTER ONE
WHAT IS DATA MINING?
Data mining is the process of searching and analyzing a large batch of raw data in order to
identify patterns and extract useful information.
Companies use data mining software to learn more about their customers. It can help them to
develop more effective marketing strategies, increase sales, and decrease costs. Data mining
relies on effective data collection, warehousing, and computer processing.
Data mining is the process of extracting knowledge or insights from large amounts of
data using various statistical and computational techniques. The data can be structured,
semi-structured or unstructured, and can be stored in various forms such as databases,
data warehouses, and data lakes.
The primary goal of data mining is to discover hidden patterns and relationships in the
data that can be used to make informed decisions or predictions. This involves
exploring the data using various techniques such as clustering, classification, regression
analysis, association rule mining, and anomaly detection.
Data mining has a wide range of applications across various industries, including
marketing, finance, healthcare, and telecommunications. For example, in marketing,
data mining can be used to identify customer segments and target marketing campaigns,
while in healthcare, it can be used to identify risk factors for diseases and develop
personalized treatment plans.
Data mining is the process of analyzing a large batch of information to discern
trends and patterns.
Data mining can be used by corporations for everything from learning about
what customers are interested in or want to buy to fraud detection and spam
filtering.
2. Course title: Introduction to Data Mining and Data Warehousing
Data mining programs break down patterns and connections in data based on
what information users request or provide.
Social media companies use data mining techniques to commodify their users
in order to generate profit.
This use of data mining has come under criticism lately as users are often
unaware of the data mining happening with their personal information,
especially when it is used to influence preferences.
Data warehouse and data mining
S.
No.
Basis of
Comparison Data Warehousing Data Mining
1. Definition
A data warehouse is a
database system that is
designed for analytical
analysis instead of
transactional work.
Data mining is the process of
analyzing data patterns.
2. Process
Data is stored
periodically.
Data is analyzed regularly.
3. Purpose
Data warehousing is the
process of extracting
and storing data to
allow easier reporting.
Data mining is the use of
pattern recognition logic to
identify patterns.
3. Course title: Introduction to Data Mining and Data Warehousing
S.
No.
Basis of
Comparison Data Warehousing Data Mining
4.
Managing
Authorities
Data warehousing is
solely carried out by
engineers.
Data mining is carried out by
business users with the help of
engineers.
5.
Data
Handling
Data warehousing is the
process of pooling all
relevant data together.
Data mining is considered as a
process of extracting data from
large data sets.
6. Functionality
Subject-oriented,
integrated, time-varying
and non-volatile
constitute data
warehouses.
AI, statistics, databases,
and machine learning systems
are all used in data mining
technologies.
7. Task
Data warehousing is the
process of extracting
and storing data in order
to make reporting more
efficient.
Pattern recognition logic is
used in data mining to find
patterns.
8. Uses
It extracts data and
stores it in an orderly
format, making
This procedure employs pattern
recognition tools to aid in the
identification of access
4. Course title: Introduction to Data Mining and Data Warehousing
S.
No.
Basis of
Comparison Data Warehousing Data Mining
reporting easier and
faster.
patterns.
9. Examples
When a data warehouse
is connected with
operational business
systems like CRM
(Customer Relationship
Management) systems,
it adds value.
Data mining aids in the creation
of suggestive patterns of key
parameters. Customer
purchasing behavior, items, and
sales are examples. As a result,
businesses will be able to make
the required adjustments to
their operations and production.
5. Course title: Introduction to Data Mining and Data Warehousing
Statistics
Data Mining
Data utilized is Numeric or Non
numeric.
Data utilized is Numeric.
Inductive Process (Generation of
modern hypothesis from data)
Deductive Process (Does not include
making any forecasts)
Data Cleaning is drained data mining.
Clean data is utilized to apply
statistical strategy.
Investigate and assemble data to begin
with, builds show to distinguish patterns
and make theories.
It gives speculations to test utilizing
statistical.
Reasonable for expansive data sets Suitable for littler data sets
Needs less client interaction to approve
model thus, simple to automate.
Needs client interaction to approve
show consequently, troublesome to
automate.
It’s an calculation which learns from
data without utilizing any programming
rule.
ationship in data within the shape of
Skills required for data mining are
Classification, Clustering, Neural
Skills required for Statistics are
Descriptive Statistical, Inferential
6. Course title: Introduction to Data Mining and Data Warehousing
Statistics
Data Mining
network, Association, Estimation,
Sequence based analysis
Statistical
Applications are Financial Data
Analysis, Retail Industry,
Telecommunication Industry,
Applications are Demography,
Actuarial ScienceBiostatistics, Quality
Control
7. Course title: Introduction to Data Mining and Data Warehousing
Advantages and Disadvantages of Data Mining
Advantages Disadvantages
It helps gather reliable information Data Mining tools are complex and require training to use
Helps businesses make operational adjustments Data mining techniques are not infallible
Helps to make informed decisions Rising privacy concerns
It helps detect risks and fraud Data mining requires large databases
Helps to understand behaviours, trends and discover
hidden patterns
Expensive
Helps to analyse very large quantities of data quickly
Pros of Data Mining
It drives profitability and efficiency
It can be applied to any type of data and business problem
It can reveal hidden information and trends
Cons of Data Mining
Complexity
Results and benefits are not guaranteed
It can be expensive
8. Course title: Introduction to Data Mining and Data Warehousing
Applications of Data Mining
Nowadays, large quantities of data are being accumulated. The amount of data collected
is said to be almost doubled every year. An extracting data or seeking knowledge from
this massive data, data mining techniques are used. Data mining is used in almost all
places where a large amount of data is stored and processed. For example, banks typically
use ‘data mining’ to find out their prospective customers who could be interested in credit
cards, personal loans, or insurance as well. Since banks have the transaction details and
detailed profiles of their customers, they analyze all this data and try to find out patterns
that help them predict that certain customers could be interested in personal loans, etc.
Basically, the motive behind mining data, whether commercial or scientific, is the same –
the need to find useful information in data to enable better decision-making or a better
understanding of the world around us.
“Extraction of interesting information or patterns from data in large databases is known
as data mining.”
According to William J.Frawley “Data mining or KDD(Knowledge Discovery in
Databases) as it is also known, is the nontrivial extraction of implicit, previously
unknown, and potentially useful information from data.”
Technically, data mining is the computational process of analyzing data from different
perspectives, dimensions, angles and categorizing/summarizing it into meaningful
information. Data Mining can be applied to any type of data e.g. Data Warehouses,
Transactional Databases, Relational Databases, Multimedia Databases, Spatial Databases,
Time-series Databases, World Wide Web.
Data mining provides competitive advantages in the knowledge economy. It does this by
providing the maximum knowledge needed to rapidly make valuable business decisions
despite the enormous amounts of available data.
There are many measurable benefits that have been achieved in different application
areas from data mining. So, let’s discuss different applications of Data Mining:
9. Course title: Introduction to Data Mining and Data Warehousing
Scientific Analysis: Scientific simulations are generating bulks of data every day. This
includes data collected from nuclear laboratories, data about human psychology, etc. Data
mining techniques are capable of the analysis of these data. Now we can capture and
store more new data faster than we can analyze the old data already accumulated.
Example of scientific analysis:
Sequence analysis in bioinformatics
Classification of astronomical objects
Medical decision support.
Intrusion Detection: A network intrusion refers to any unauthorized activity on a
digital network. Network intrusions often involve stealing valuable network resources.
Data mining technique plays a vital role in searching intrusion detection, network attacks,
and anomalies. These techniques help in selecting and refining useful and relevant
information from large data sets. Data mining technique helps in classify relevant data for
Intrusion Detection System. Intrusion Detection system generates alarms for the network
traffic about the foreign invasions in the system. For example:
Detect security violations
Misuse Detection
10. Course title: Introduction to Data Mining and Data Warehousing
Anomaly Detection
Business Transactions: Every business industry is memorized for perpetuity. Such
transactions are usually time-related and can be inter-business deals or intra-business
operations. The effective and in-time use of the data in a reasonable time frame for
competitive decision-making is definitely the most important problem to solve for
businesses that struggle to survive in a highly competitive world. Data mining helps to
analyze these business transactions and identify marketing approaches and decision-
making. Example :
Direct mail targeting
Stock trading
Customer segmentation
Churn prediction (Churn prediction is one of the most popular Big Data use cases in
business)
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful
study of purchases done by a customer in a supermarket. This concept identifies the
pattern of frequent purchase items by customers. This analysis can help to promote deals,
11. Course title: Introduction to Data Mining and Data Warehousing
offers, sale by the companies and data mining techniques helps to achieve this analysis
task. Example:
Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response rates.
Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data
Mining (EDM) method. This method generates patterns that can be used both by learners
and educators. By using data mining EDM we can perform some educational task:
Predicting students admission in higher education
Predicting students profiling
Predicting student performance
Teachers teaching performance
Curriculum development
Predicting student placement opportunities
Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area. Rules generated
by data mining are unique to find results. In most of the technical research in data mining,
we create a training model and testing model. The training/testing model is a strategy to
measure the precision of the proposed model. It is called Train/Test because we split the
data set into two sets: a training data set and a testing data set. A training data set used to
design the training model whereas testing data set is used in the testing model. Example:
Classification of uncertain data.
Information-based clustering.
Decision support system
Web Mining
Domain-driven data mining
12. Course title: Introduction to Data Mining and Data Warehousing
IoT (Internet of Things)and Cybersecurity
Smart farming IoT(Internet of Things)
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force
activity and their outcomes to improve the focusing of high-value physicians and figure
out which promoting activities will have the best effect in the following upcoming
months, Whereas the Insurance sector, data mining can help to predict which customers
will buy new policies, identify behavior patterns of risky customers and identify
fraudulent behavior of customers.
Claims analysis i.e which medical procedures are claimed together.
Identify successful medical therapies for different illnesses.
Characterizes patient behavior to predict office visits.
Transportation: A diversified transportation company with a large direct sales force can
apply data mining to identify the best prospects for its services. A large consumer
merchandise organization can apply information mining to improve its business cycle to
retailers.
Determine the distribution schedules among outlets.
Analyze loading patterns.
Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new
credit product.
Credit card fraud detection.
Identify ‘Loyal’ customers.
Extraction of information related to customers.
Determine credit card spending by customer groups.
13. Course title: Introduction to Data Mining and Data Warehousing
Data Mining vs Statistics
Data Mining Statistics
Explorative – Dig out the data first, Confirmative – Provide theory first and
discover novel patterns and then make then test it using various statistical
theories. tools.
Statistical methods applied on Clean
Involves Data Cleaning
Data
Usually involves working with large Usually involves working with small
datasets. datasets.
Makes generous use of heuristics think There is no scope for heuristics think.
Deductive (Does not involve making
Inductive process
any predictions)
Numeric and Non-Numeric Data Numeric Data
Less concerned about data collection. More concerned about data collection.
Some of the popular data mining
methods include –Estimation, Some of the popular statistical methods
Classification, Neural Networks, include –Inferential and Descriptive
Clustering, Association, and Statistics.
Visualization .
14. Course title: Introduction to Data Mining and Data Warehousing
Challenges of Data Mining
Data mining, the process of extracting knowledge from data, has become increasingly
important as the amount of data generated by individuals, organizations, and machines
has grown exponentially. However, data mining is not without its challenges.
1]Data Quality
The quality of data used in data mining is one of the most significant challenges. The
accuracy, completeness, and consistency of the data affect the accuracy of the results
obtained. The data may contain errors, omissions, duplications, or inconsistencies, which
may lead to inaccurate results. Moreover, the data may be incomplete, meaning that some
attributes or values are missing, making it challenging to obtain a complete understanding
of the data.
Data quality issues can arise due to a variety of reasons, including data entry errors, data
storage issues, data integration problems, and data transmission errors. To address these
challenges, data mining practitioners must apply data cleaning and data preprocessing
techniques to improve the quality of the data. Data cleaning involves detecting and
correcting errors, while data preprocessing involves transforming the data to make it
suitable for data mining.
2] Data Complexity
Data complexity refers to the vast amounts of data generated by various sources, such as
sensors, social media, and the internet of things (IoT). The complexity of the data may
make it challenging to process, analyze, and understand. In addition, the data may be in
different formats, making it challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced techniques such as
clustering, classification, and association rule mining. These techniques help to identify
patterns and relationships in the data, which can then be used to gain insights and make
predictions.
3] Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more data is
15. Course title: Introduction to Data Mining and Data Warehousing
collected, stored, and analyzed, the risk of data breaches and cyber-attacks increases. The
data may contain personal, sensitive, or confidential information that must be protected.
Moreover, data privacy regulations such as GDPR, CCPA, and HIPAA impose strict
rules on how data can be collected, used, and shared.
To address this challenge, data mining practitioners must apply data anonymization and
data encryption techniques to protect the privacy and security of the data. Data
anonymization involves removing personally identifiable information (PII) from the data,
while data encryption involves using algorithms to encode the data to make it unreadable
to unauthorized users.
4] Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As the size
of the dataset increases, the time and computational resources required to perform data
mining operations also increase. Moreover, the algorithms must be able to handle
streaming data, which is generated continuously and must be processed in real-time.
To address this challenge, data mining practitioners use distributed computing
frameworks such as Hadoop and Spark. These frameworks distribute the data and
processing across multiple nodes, making it possible to process large datasets quickly and
efficiently.
4] interpretability
Data mining algorithms can produce complex models that are difficult to interpret. This is
because the algorithms use a combination of statistical and mathematical techniques to
identify patterns and relationships in the data. Moreover, the models may not be intuitive,
making it challenging to understand how the model arrived at a particular conclusion.
To address this challenge, data mining practitioners use visualization techniques to
represent the data and the models visually. Visualization makes it easier to understand the
patterns and relationships in the data and to identify the most important variables.
5] Ethics
Data mining raises ethical concerns related to the collection, use, and dissemination of
data. The data may be used to discriminate against certain groups, violate privacy rights,
16. Course title: Introduction to Data Mining and Data Warehousing
or perpetuate existing biases. Moreover, data mining algorithms may not be transparent,
making it challenging to detect biases or discrimination.