Data Mining Notes
Data Mining Notes
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
Design and construction of data warehouses for multidimensional data analysis and
data mining.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead
to improved quality of customer service and good customer retention and satisfaction. Here is
the list of examples of data mining in the retail industry −
Design and Construction of data warehouses based on the benefits of data mining.
Customer Retention.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication technologies,
the telecommunication industry is rapidly expanding. This is the reason why data mining is
become very important to help and understand the business.Data mining in telecommunication
industry helps in identifying the telecommunication patterns, catch fraudulent activities, make
better use of resource, and improve quality of service. Here is the list of examples for which
data mining improves telecommunication services −
Multidimensional Analysis of Telecommunication data.
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics.
Following are the aspects in which data mining contributes for biological data analysis −
Discovery of structural patterns and analysis of genetic networks and protein pathways.
2. Define KDD
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets. The KDD process
is an iterative process and it requires multiple iterations of the above steps to extract
accurate knowledge from the data.
Classification is a supervised learning technique used to categorize data into predefined classes
or labels. It involves building a model based on input features and their corresponding target
labels. The model is trained on a labeled dataset, where each data instance is associated with a
known class or category.
The goal of classification is to learn a mapping function from input features to output classes,
which can then be used to predict the class labels of new, unseen data instances. Common
classification algorithms include Decision Trees, Naive Bayes, Support Vector Machines (SVM),
and Neural Networks.
Clustering:
Clustering is an unsupervised learning technique used to group similar data points together
based on their intrinsic properties or characteristics. Unlike classification, clustering does not
require labeled data and aims to discover hidden patterns or structures within the data.
Clustering algorithms partition the data into clusters or groups, where data points within the
same cluster are more similar to each other than to those in other clusters. The goal is to
maximize intra-cluster similarity and minimize inter-cluster similarity.
Anomaly Detection:
Anomaly detection involves identifying rare or unusual patterns or instances in data that do not
conform to expected behavior. It is used to detect outliers, anomalies, or fraudulent activities in
various domains such as finance, cybersecurity, and healthcare. Examples include detecting
fraudulent transactions, network intrusions, or equipment failures.
Privacy − It is a loaded issue. In current years privacy concerns have taken on a more important
role in American society as merchants, insurance companies, and government agencies a mass
warehouses including personal records.The concerns that people have over the group of this
data will generally extend to some analytic capabilities used to the data. Users of data mining
should start thinking about how their use of this technology will be impacted by legal problems
associated with privacy.
Profiling − Data Mining and profiling is a developing field that attempts to organize,
understand, analyze, reason, and use the explosion of data in this information age. The process
contains using algorithms and experience to extract design or anomalies that are very complex,
difficult, or time-consuming to recognize.
Unauthorized Used − Trends obtain through data mining designed to be used for marketing
goals or some other ethical goals, can be misused. Unethical businesses or people can use the
data obtained through data mining to take benefit of vulnerable people or discriminate against
a specific group of people. Furthermore, the data mining technique is not 100 percent accurate;
thus mistakes do appear which can have serious results.
6. What is the different between data mining and normal query processing?
Decision Trees:
Decision trees are a popular machine learning technique used for classification and regression
tasks. They represent a tree-like structure where each internal node represents a decision
based on an attribute, each branch represents the outcome of the decision, and each leaf node
represents the class label or predicted value.
In classification tasks, decision trees recursively split the dataset into subsets based on the
values of attributes, aiming to maximize the purity of each subset (i.e., minimize impurity or
uncertainty). This splitting process continues until the data is fully classified or a stopping
criterion is met.
Decision trees are interpretable and easy to understand, making them useful for explaining the
decision-making process. However, they may suffer from overfitting, where the model captures
noise or irrelevant details in the training data, leading to poor generalization on unseen data.
Example: A decision tree can be used to predict whether a customer will purchase a product
based on demographic features such as age, income, and location. The tree would make
decisions at each node (e.g., if age < 30, if income > $50,000), ultimately leading to a prediction
of whether the customer will buy the product or not.
The Apriori algorithm is a well-known algorithm for association rule mining. It works by
generating candidate itemsets of increasing sizes and pruning those that do not meet a
minimum support threshold. Association rules are then generated from frequent itemsets,
where a rule consists of an antecedent (left-hand side) and a consequent (right-hand side).
Data warehouses serve as the foundation for data mining by providing a centralized repository
of high-quality, integrated data for analysis. The structured and consolidated nature of data
warehouses facilitates efficient data mining operations.
Data mining leverages the rich historical data stored in the data warehouse to uncover valuable
insights, patterns, and relationships that can inform strategic decision-making, improve
operational efficiency, and drive business growth.
In summary, while data warehousing focuses on the storage and management of data, data
mining focuses on the analysis and extraction of knowledge from that data. Together, they form
an integrated approach to harnessing the power of data for business intelligence and decision
support.
• Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.
• Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. This representations
should be easily understandable by the users.
• Handling noisy or incomplete data. - The data cleaning methods are required that can handle
the noise, incomplete objects while mining the data regularities. If data cleaning methods are
not there then the accuracy of the discovered patterns will be poor.
• Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
11. Explain the differences between Knowledge discovery and data mining.
12. Explain data cleaning.
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and
correcting errors, inconsistencies, inaccuracies, and missing values in a dataset to improve its
quality and reliability for analysis. It is a crucial step in the data preprocessing pipeline before
performing data mining or analysis tasks. Here's an explanation of data cleaning:
The first step in data cleaning is to identify errors and inconsistencies in the dataset. This may
include detecting missing values, outliers, duplicates, incorrect data formats, typographical
errors, and other anomalies that could affect the integrity and accuracy of the data.
Missing values are common in datasets and can arise due to various reasons such as data entry
errors, equipment malfunction, or intentional non-responses. Data cleaning involves handling
missing values by either imputing them with estimated values (e.g., mean, median, mode),
removing them from the dataset, or using advanced imputation techniques such as regression
or k-nearest neighbors.
Data cleaning involves correcting errors and inconsistencies in the dataset to ensure its
accuracy and reliability. This may include correcting typographical errors, standardizing data
formats, resolving inconsistencies between related attributes, and reconciling conflicting
information from different sources.
• The Star Schema data model is the simplest type of Data Warehouse schema.
• It is also known as Star Join Schema and is optimized for querying large data sets.
In the following Star Schema example, the fact table is at the center which contains keys to
every dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID & other
attributes like Units sold and revenue
5. Dimensional Model
6. Fact Tables
7. Dimension Tables
8. Metadata Repository
9. Data Marts
The first step in OLAP involves creating a multidimensional data cube, also known as a
hypercube or OLAP cube. A data cube represents data in multiple dimensions, such as time,
geography, product, and sales, allowing for analysis from different perspectives.
Dimensional Hierarchies:
Each dimension in the data cube is organized into a hierarchical structure with levels of
granularity. For example, a time dimension may have levels such as year, quarter, month, and
day. Similarly, a product dimension may have levels such as category, subcategory, and
product.
OLAP cubes are precomputed and aggregated at various levels of granularity to improve query
performance. Aggregation involves summarizing and consolidating data across different
dimensions. Precomputation involves calculating and storing aggregated values for faster
retrieval during analysis.
OLAP allows users to slice, dice, and drill-down into the data cube to explore data from
different perspectives. Slicing involves selecting a subset of data along one or more dimensions.
Dicing involves viewing data from multiple dimensions simultaneously. Drill-down involves
navigating from higher-level summaries to lower-level details.
OLAP enables users to roll-up or roll-down data along hierarchical dimensions to aggregate or
disaggregate data. Rolling up involves summarizing data from lower-level to higher-level
dimensions (e.g., from daily to monthly sales). Rolling down involves decomposing data from
higher-level to lower-level dimensions (e.g., from yearly to quarterly sales).
OLAP allows users to pivot or rotate the data cube to view it from different angles or
orientations. Pivoting involves reorienting the dimensions to change the perspective of analysis.
Rotation involves swapping the dimensions to explore data from alternative viewpoints.
b. Subject-Oriented:
A data warehouse is designed to focus on specific subject areas or business processes relevant
to the organization. It organizes data around key business topics such as sales, marketing,
finance, and inventory, enabling users to analyze data within the context of their business
domain.
c. Time-Variant:
Data warehouses typically maintain historical data over time, allowing users to analyze trends,
patterns, and changes in data over different time periods. Historical data is valuable for trend
analysis, forecasting, and decision-making.
Data warehouses provide a centralized and integrated view of data from multiple sources,
allowing organizations to make informed decisions based on comprehensive insights. By
analyzing data stored in the data warehouse, decision-makers can identify trends, patterns, and
correlations, enabling them to make strategic decisions that drive business growth and
competitive advantage.
Data warehouses support advanced analytics and reporting capabilities, enabling organizations
to derive actionable intelligence from their data. With the ability to perform complex queries,
ad-hoc analysis, and generate custom reports, users can gain deeper insights into business
performance, customer behavior, market trends, and operational efficiency. This enhanced
business intelligence empowers organizations to optimize processes, improve customer
satisfaction, and capitalize on opportunities in the marketplace.
Data sources are where data originates from, such as operational databases, spreadsheets,
CRM systems, ERP systems, and external sources like social media platforms.
ETL Process:
ETL stands for Extract, Transform, and Load. In this process, data is extracted from various
sources, transformed to fit the data warehouse schema, and loaded into the data warehouse.
The data warehouse database is the central repository that stores integrated, cleansed, and
structured data from multiple sources. It is optimized for analytical querying and reporting.
The data warehouse schema defines the structure and organization of data within the data
warehouse. Common schema designs include star schema, snowflake schema, and galaxy
schema.
Data Marts:
Data marts are subsets of the data warehouse that focus on specific subject areas or business
units. They contain pre-aggregated or summarized data tailored to the needs of a particular
user group or analytical application.
OLAP Tools:
OLAP (Online Analytical Processing) tools provide users with the ability to query, analyze, and
visualize data stored in the data warehouse. They support interactive analysis and reporting,
allowing users to gain insights and make informed decisions.
Metadata Repository:
The metadata repository stores metadata, which is data about the data in the data warehouse.
It includes information about data sources, data transformations, data definitions, and data
lineage.
•Data reduction - Obtains reduced representation in volume but produces the same or
similar analytical results
•Data discretization - Data discretization is a data preprocessing technique used in data mining
and machine learning to reduce the number of continuous attributes (or features) by dividing
them into intervals or categories.
22. How does a snowflake schema differ from a star schema? Name two
advantages and two disadvantages of the snowflake schema.
Difference:
Star Schema: In a star schema, the dimensional model consists of a central fact table
surrounded by multiple dimension tables. Each dimension table is directly connected to the fact
table, forming a star-like structure.
Snowflake Schema: In a snowflake schema, the dimensional model is normalized, meaning that
dimension tables are further divided into sub-dimension tables. This results in a more complex
structure resembling a snowflake, with multiple levels of normalization.
Improved Data Consistency: With normalized tables, updates and modifications to dimension
attributes are applied consistently across multiple sub-dimension tables, ensuring data
integrity.
Query Performance: The snowflake schema may suffer from slower query performance
compared to the star schema, especially when dealing with deep levels of normalization and a
large number of joins.
In dimensional modeling, the transaction record is divided into either "facts," which are
frequently numerical transaction data, or "dimensions," which are the reference information
that gives context to the facts.
For example, a sale transaction can be damage into facts such as the number of products
ordered and the price paid for the products, and into dimensions such as order date, user
name, product number, order ship-to, and bill-to locations, and salesman responsible for
receiving the order.
• To produce database architecture that is easy for end-clients to understand and write queries.
• To maximize the efficiency of queries. It achieves these goals by minimizing the number of
tables and relationships between them.
Data cleaning is one of the essential steps in data preprocessing, aimed at identifying and
correcting errors, inconsistencies, and missing values in the dataset. Here's an explanation of
data cleaning:
Identifying Missing Values:
The first step in data cleaning is to identify missing values in the dataset. Missing values can
occur due to various reasons such as data entry errors, equipment malfunction, or
non-responses.
Once missing values are identified, they need to be handled appropriately. There are several
techniques for handling missing values, including: Deleting Rows or Columns: If the number of
missing values is small compared to the total dataset, deleting rows or columns containing
missing values may be a viable option.
Imputation:
Imputation involves replacing missing values with estimated or imputed values. Common
imputation methods include mean imputation, median imputation, mode imputation, or using
predictive models to estimate missing values.
Data cleaning also involves correcting errors and inconsistencies in the dataset. This may
include correcting typographical errors, standardizing data formats, resolving inconsistencies
between related attributes, and reconciling conflicting information from different sources.
Handling Outliers: Outliers are data points that significantly deviate from the rest of the data.
Data cleaning may involve identifying and handling outliers using techniques such as trimming,
winsorizing, or transforming the data to reduce the impact of outliers on the analysis.
Support is a measure used in association rule mining to indicate the frequency or occurrence of
a particular itemset in a dataset.It represents the proportion of transactions in the dataset that
contain the itemset.Mathematically, support is calculated as the number of transactions
containing the itemset divided by the total number of transactions in the dataset.Higher
support values indicate that the itemset is more frequently occurring in the dataset.
Confidence:
Confidence is a measure used in association rule mining to indicate the reliability or strength of
the association between two itemsets.It represents the conditional probability that a
transaction containing one itemset also contains another itemset.Mathematically, confidence is
calculated as the number of transactions containing both itemsets divided by the number of
transactions containing the first itemset.
Frequent Sets:
Frequent sets refer to sets of items that appear together in transactions with a frequency above
a specified threshold.In association rule mining, frequent sets are crucial as they form the basis
for discovering meaningful associations between items.
Association Rule:
An association rule is a relationship between two sets of items in a dataset, typically expressed
in the form X→Y.It indicates that if itemset X appears in a transaction, then itemset Y is likely to
appear as well.
Association rules are discovered through analysis of frequent itemsets and are evaluated based
on measures like support and confidence.
The Apriori algorithm can suffer from high computational complexity, especially when dealing
with large datasets or datasets with a large number of unique items.
Generating candidate itemsets and scanning the dataset multiple times to calculate support can
be computationally intensive, leading to increased runtime and memory requirements.
As the number of items and transactions in the dataset grows, the number of candidate
itemsets also increases exponentially, resulting in longer execution times and scalability issues.
The Apriori algorithm may encounter challenges when dealing with sparse datasets where most
itemsets have low support.
In datasets with sparse or uneven distribution of items, the algorithm may generate a large
number of candidate itemsets, many of which may have low support and are eventually
pruned.
Sparse datasets can result in inefficient memory usage and longer runtime, as the algorithm
spends time generating and processing candidate itemsets that do not contribute significantly
to the discovery of frequent itemsets and association rules.
27. What is Apriori property?
The Apriori property is an important concept in association rule mining, particularly in the
Apriori algorithm. It states that if an itemset is frequent, then all of its subsets must also be
frequent. In other words, if an itemset is considered frequent (i.e., its support is above a
specified threshold), then all of its subsets must meet the same support threshold.
This property simplifies the process of generating candidate itemsets in the Apriori algorithm.
Instead of considering all possible combinations of items, the algorithm only generates
candidate itemsets from frequent itemsets, reducing the search space and improving efficiency.
Incremental Apriori
Revealing Hidden Patterns: Association rule mining uncovers hidden patterns and
relationships within data.
Fast Query Performance: Queries in star schema are fast due to simplified joins
between the fact table and dimension tables, leading to quick data retrieval.
Efficient Data Access: Star schema enables rapid access to specific data points within
the dataset, crucial for decision-making and analysis.
A discrete value is a finite or countably infinite set of values, For Example, age, size, etc. The
models where the target values are represented by continuous values are usually numbers that
are called Regression Models. Continuous variables are floating-point variables. These two
models together are called CART. CART uses Gini Index as Classification matrix.
calculation of P(B).
The above formula can be described with brackets around the denominator
P(not A) = 1 – P(A)
Bayes Theorem consists of several terms whose names are given based on thecontext of its
application in the equation.
33. What are the advantages and disadvantages of decision tress over other
classification methods?
Advantages Of Decision Tree Classification
1. Decision tree classification does not require any domain knowledge,hence, it is appropriate
for the knowledge discovery process.
2. The representation of data in the form of the tree is easily understood by humans and it is
intuitive.
1. Sometimes decision trees become very complex and these are called overfitted trees.
3. The decision trees may return a biased solution if some class label dominates it.
#1) Prepruning: In this approach, the construction of the decision tree is stopped early. It
means it is decided not to further partition the branches. Thelast node constructed becomes
the leaf node and this leaf node may hold themost frequent class among the tuples.The
attribute selection measures are used to find out the weightage of the split.Threshold values
are prescribed to decide which splits are regarded as useful. Ifthe portioning of the node results
in splitting by falling below threshold then theprocess is halted.
#2) Postpruning: This method removes the outlier branches from a fully grown tree. The
unwanted branches are removed and replaced by a leaf node denoting the most frequent class
label. This technique requires more computation than prepruning, however, it is more reliable.
The pruned trees are more precise and compact when compared to unpruned trees but they
carry a disadvantage of replication and repetition. Repetition occurs when the same attribute is
tested again and again along a branch of a tree. Replication occurs when the duplicate subtrees
are present within the tree. These issues can be solved by multivariate splits.
The “Naive” part of the name indicates the simplifying assumption made by the Naïve Bayes
classifier. The classifier assumes that the features used to describe an observation are
conditionally independent, given the class label. The “Bayes”part of the name refers to
Reverend Thomas Bayes, an 18th-century statisticianand theologian who formulated Bayes’
theorem.Consider a fictional dataset that describes the weather conditions for playing agame
of golf. Given the weather conditions, each tuple classifies the conditions as fit (“Yes”) or unfit
(“No”) for playing golf. Here is a tabular representation of our dataset.
Logistic regression is a linear model used for binary classification tasks.It estimates the
probability that a given instance belongs to a particular class using a logistic (or sigmoid)
function.Despite its name, it's a classification algorithm rather than a regression one.
Decision Trees:
Decision trees partition the feature space into regions, making decisions based on simple rules
inferred from the data.They're intuitive and easy to interpret, making them suitable for
visualization and explaining the model's predictions. Common algorithms include CART
(Classification and Regression Trees) and ID3 (Iterative Dichotomiser 3).
SVM is a powerful supervised learning algorithm used for classification and regression tasks. It
finds the hyperplane that best separates classes in the feature space with the maximum
margin. SVM can handle high-dimensional data and is effective even in cases where the number
of features exceeds the number of samples.
Naive Bayes:
Naive Bayes is a probabilistic classifier based on Bayes' theorem with a strong independence
assumption between features.Despite its simplicity, it often performs well in text classification
and other tasks with high-dimensional data.
Decision trees make decisions by recursively splitting the data based on the values of
features.The splitting criterion aims to maximize the homogeneity (or purity) of the resulting
subsets. Common criteria include Gini impurity, entropy, and classification error.
Nodes:
Decision trees consist of nodes that represent decision points or questions about the
features.The initial node is called the root node, and subsequent nodes are called internal
nodes. Internal nodes represent feature attributes, and leaf nodes represent class labels.
Branches:
Branches emanate from nodes and represent the possible outcomes or decisions based on the
value of a feature. Each branch leads to a child node corresponding to a specific value of the
feature.
Decision Rules:
The decision rules at each node are based on the splitting criteria.These rules determine which
branch to follow based on the feature values of the instances.
WEKA stands for Waikato Environment for Knowledge Analysis. It is a popular suite of machine
learning software written in Java, developed at the University of Waikato in New Zealand.
WEKA provides a comprehensive set of tools for data preprocessing, classification, regression,
clustering, association rules mining, and visualization.
User-Friendly Interface: WEKA offers a user-friendly graphical interface that allows users to
perform various machine learning tasks without the need for extensive programming
knowledge.
Open-Source: WEKA is open-source software distributed under the GNU General Public License
(GPL). This means that users can access the source code, modify it according to their needs, and
contribute to its development.
Integration with Java: Since WEKA is implemented in Java, it seamlessly integrates with
Java-based applications and frameworks, making it easy to incorporate machine learning
capabilities into Java projects.
Educational Resources: WEKA is widely used in academic settings for teaching and learning
purposes. It provides numerous educational resources, including tutorials, documentation, and
example datasets, which are valuable for students and researchers studying machine learning
and data mining concepts.
Scalability: While WEKA may not be as scalable as some other machine learning libraries
designed for big data processing, it still performs well on medium-sized datasets and is suitable
for prototyping and experimentation.
40.What is Clustering? What are different types of clustering?
Clustering is a technique used in unsupervised learning to group similar data points together
based on their characteristics or features. The goal of clustering is to partition a dataset into
groups, or clusters, where data points within the same cluster are more similar to each other
than to those in other clusters.Here are different types of clustering methods:
Partitioning Methods:
Partitioning methods divide the dataset into non-overlapping clusters. Each data point belongs
to exactly one cluster.Examples include K-means clustering and K-medoids clustering.
Hierarchical Methods:
Hierarchical clustering creates a hierarchy of clusters where clusters can contain sub-clusters.It
can be agglomerative, starting with individual data points as clusters and iteratively merging
them, or divisive, starting with all data points in one cluster and recursively splitting
them.Examples include Agglomerative Hierarchical Clustering (bottom-up) and Divisive
Hierarchical Clustering (top-down).
Density-Based Methods:
Density-based methods group together data points that are densely packed in the feature
space, forming regions of high density separated by regions of low density.Clusters can be
irregularly shaped and of varying sizes.Examples include DBSCAN (Density-Based Spatial
Clustering of Applications with Noise) and OPTICS (Ordering Points To Identify the Clustering
Structure).
Distribution-Based Methods:
Distribution-based methods model the distribution of the data using probability density
functions.They assume that data points are generated from a mixture of probability
distributions.Examples include Gaussian Mixture Models (GMM) and Expectation-Maximization
(EM) clustering.