0% found this document useful (0 votes)
23 views

Data Mining Notes

Uploaded by

sayyedaman530
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Data Mining Notes

Uploaded by

sayyedaman530
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

1. Give any 2 application of Data mining.

Financial Data Analysis

The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −

 Design and construction of data warehouses for multidimensional data analysis and
data mining.

 Loan payment prediction and customer credit policy analysis.

 Classification and clustering of customers for targeted marketing.

 Detection of money laundering and other financial crimes.

Retail Industry

Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead
to improved quality of customer service and good customer retention and satisfaction. Here is
the list of examples of data mining in the retail industry −

 Design and Construction of data warehouses based on the benefits of data mining.

 Multidimensional analysis of sales, customers, products, time and region.

 Analysis of effectiveness of sales campaigns.

 Customer Retention.

Telecommunication Industry

Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication technologies,
the telecommunication industry is rapidly expanding. This is the reason why data mining is
become very important to help and understand the business.Data mining in telecommunication
industry helps in identifying the telecommunication patterns, catch fraudulent activities, make
better use of resource, and improve quality of service. Here is the list of examples for which
data mining improves telecommunication services −
 Multidimensional Analysis of Telecommunication data.

 Fraudulent pattern analysis.

 Identification of unusual patterns.

Biological Data Analysis

In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics.

Following are the aspects in which data mining contributes for biological data analysis −

 Semantic integration of heterogeneous, distributed genomic and proteomic databases.

 Alignment, indexing, similarity search and comparative analysis multiple nucleotide


sequences.

 Discovery of structural patterns and analysis of genetic networks and protein pathways.

 Association and path analysis.

 Visualization tools in genetic data analysis.

2. Define KDD
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets. The KDD process
is an iterative process and it requires multiple iterations of the above steps to extract
accurate knowledge from the data.

3. Explain different data mining tasks.


Classification:

Classification is a supervised learning technique used to categorize data into predefined classes
or labels. It involves building a model based on input features and their corresponding target
labels. The model is trained on a labeled dataset, where each data instance is associated with a
known class or category.

The goal of classification is to learn a mapping function from input features to output classes,
which can then be used to predict the class labels of new, unseen data instances. Common
classification algorithms include Decision Trees, Naive Bayes, Support Vector Machines (SVM),
and Neural Networks.

Clustering:

Clustering is an unsupervised learning technique used to group similar data points together
based on their intrinsic properties or characteristics. Unlike classification, clustering does not
require labeled data and aims to discover hidden patterns or structures within the data.

Clustering algorithms partition the data into clusters or groups, where data points within the
same cluster are more similar to each other than to those in other clusters. The goal is to
maximize intra-cluster similarity and minimize inter-cluster similarity.

Common clustering algorithms include K-means clustering, Hierarchical clustering, DBSCAN


(Density-Based Spatial Clustering of Applications with Noise), and Gaussian Mixture Models
(GMM).

Association Rule Mining:

Association rule mining aims to discover interesting relationships or associations between


variables in large datasets. It is often used in market basket analysis to identify patterns in
consumer behavior, such as frequent itemsets or purchasing patterns. Examples include
discovering that customers who buy bread are also likely to buy milk or that certain medical
symptoms often co-occur.

Anomaly Detection:

Anomaly detection involves identifying rare or unusual patterns or instances in data that do not
conform to expected behavior. It is used to detect outliers, anomalies, or fraudulent activities in
various domains such as finance, cybersecurity, and healthcare. Examples include detecting
fraudulent transactions, network intrusions, or equipment failures.

4. What are Social implication of Data mining?


There are various social implications of data mining which are as follows −

Privacy − It is a loaded issue. In current years privacy concerns have taken on a more important
role in American society as merchants, insurance companies, and government agencies a mass
warehouses including personal records.The concerns that people have over the group of this
data will generally extend to some analytic capabilities used to the data. Users of data mining
should start thinking about how their use of this technology will be impacted by legal problems
associated with privacy.

Profiling − Data Mining and profiling is a developing field that attempts to organize,
understand, analyze, reason, and use the explosion of data in this information age. The process
contains using algorithms and experience to extract design or anomalies that are very complex,
difficult, or time-consuming to recognize.

Unauthorized Used − Trends obtain through data mining designed to be used for marketing
goals or some other ethical goals, can be misused. Unethical businesses or people can use the
data obtained through data mining to take benefit of vulnerable people or discriminate against
a specific group of people. Furthermore, the data mining technique is not 100 percent accurate;
thus mistakes do appear which can have serious results.

5. Define Data Mining


Data mining is the process of automatically discovering useful information in large data
repositories. Human analysts may take weeks to discover useful information.Data mining is the
process of discovering patterns, relationships, and insights from large datasets using various
techniques from statistics, machine learning, and database systems. It involves extracting
valuable information and knowledge from raw data, often stored in databases, data
warehouses, or other data repositories.

6. What is the different between data mining and normal query processing?

7.Explain the differences between “Explorative Data Mining” and Predictive


Data Mining” and give one example of each.
8. Explain different data mining tasks.

Decision Trees:
Decision trees are a popular machine learning technique used for classification and regression
tasks. They represent a tree-like structure where each internal node represents a decision
based on an attribute, each branch represents the outcome of the decision, and each leaf node
represents the class label or predicted value.

In classification tasks, decision trees recursively split the dataset into subsets based on the
values of attributes, aiming to maximize the purity of each subset (i.e., minimize impurity or
uncertainty). This splitting process continues until the data is fully classified or a stopping
criterion is met.

Decision trees are interpretable and easy to understand, making them useful for explaining the
decision-making process. However, they may suffer from overfitting, where the model captures
noise or irrelevant details in the training data, leading to poor generalization on unseen data.

Example: A decision tree can be used to predict whether a customer will purchase a product
based on demographic features such as age, income, and location. The tree would make
decisions at each node (e.g., if age < 30, if income > $50,000), ultimately leading to a prediction
of whether the customer will buy the product or not.

Association Rule Mining:

Association rule mining is a technique used to discover interesting relationships or associations


between variables in large datasets. It is commonly applied in market basket analysis to identify
patterns in consumer behavior, such as frequently co-occurring items in transactions.

The Apriori algorithm is a well-known algorithm for association rule mining. It works by
generating candidate itemsets of increasing sizes and pruning those that do not meet a
minimum support threshold. Association rules are then generated from frequent itemsets,
where a rule consists of an antecedent (left-hand side) and a consequent (right-hand side).

Example: Suppose a supermarket wants to analyze customer purchase data to identify


purchasing patterns. Association rule mining can reveal associations like "Customers who buy
milk are also likely to buy bread" or "Customers who buy diapers are also likely to buy baby
wipes." These insights can inform marketing strategies, product placement, and promotions to
increase sales and customer satisfaction.

9.What is the relation between data warehousing and data mining?


Data warehousing and data mining are complementary concepts that work together to enable
effective data-driven decision-making within organizations.

Data warehouses serve as the foundation for data mining by providing a centralized repository
of high-quality, integrated data for analysis. The structured and consolidated nature of data
warehouses facilitates efficient data mining operations.

Data mining leverages the rich historical data stored in the data warehouse to uncover valuable
insights, patterns, and relationships that can inform strategic decision-making, improve
operational efficiency, and drive business growth.

In summary, while data warehousing focuses on the storage and management of data, data
mining focuses on the analysis and extraction of knowledge from that data. Together, they form
an integrated approach to harnessing the power of data for business intelligence and decision
support.

10. What are the key issues in data Mining?


• Mining different kinds of knowledge in databases. - The need of different users is not the
same Different user may be in interested in different kind of knowledge. Therefore it is
necessary for data mining to cover broad range of knowledge discovery task

• Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.

• Incorporation of background knowledge. - To guide discovery process and to express the


discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple level of
abstraction.

• Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.

• Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. This representations
should be easily understandable by the users.

• Handling noisy or incomplete data. - The data cleaning methods are required that can handle
the noise, incomplete objects while mining the data regularities. If data cleaning methods are
not there then the accuracy of the discovered patterns will be poor.

• Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered


should be interesting because either they represent common knowledge or lack novelty.

• Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.

11. Explain the differences between Knowledge discovery and data mining.
12. Explain data cleaning.
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and
correcting errors, inconsistencies, inaccuracies, and missing values in a dataset to improve its
quality and reliability for analysis. It is a crucial step in the data preprocessing pipeline before
performing data mining or analysis tasks. Here's an explanation of data cleaning:

Identifying Errors and Inconsistencies:

The first step in data cleaning is to identify errors and inconsistencies in the dataset. This may
include detecting missing values, outliers, duplicates, incorrect data formats, typographical
errors, and other anomalies that could affect the integrity and accuracy of the data.

Handling Missing Values:

Missing values are common in datasets and can arise due to various reasons such as data entry
errors, equipment malfunction, or intentional non-responses. Data cleaning involves handling
missing values by either imputing them with estimated values (e.g., mean, median, mode),
removing them from the dataset, or using advanced imputation techniques such as regression
or k-nearest neighbors.

Correcting Errors and Inconsistencies:

Data cleaning involves correcting errors and inconsistencies in the dataset to ensure its
accuracy and reliability. This may include correcting typographical errors, standardizing data
formats, resolving inconsistencies between related attributes, and reconciling conflicting
information from different sources.

13. Differentiate between OLTP & OLAP.

14. Define Star Schema with diagram.


Star Schema in data warehouse, in which the center of the star can have one fact table and a
number of associated dimension tables.

• It is known as star schema as its structure resembles a star.

• The Star Schema data model is the simplest type of Data Warehouse schema.

• It is also known as Star Join Schema and is optimized for querying large data sets.

In the following Star Schema example, the fact table is at the center which contains keys to
every dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID & other
attributes like Units sold and revenue

15.List components of Data Warehouse.


1. Operational Data Sources

2. ETL (Extract, Transform, Load) Process

3. Data Warehouse Database

4. Data Warehouse Schema

5. Dimensional Model

6. Fact Tables

7. Dimension Tables

8. Metadata Repository

9. Data Marts

10. Query and Reporting Tools

16. Explain basic operation of OLAP.


Data Cube Creation:

The first step in OLAP involves creating a multidimensional data cube, also known as a
hypercube or OLAP cube. A data cube represents data in multiple dimensions, such as time,
geography, product, and sales, allowing for analysis from different perspectives.

Dimensional Hierarchies:

Each dimension in the data cube is organized into a hierarchical structure with levels of
granularity. For example, a time dimension may have levels such as year, quarter, month, and
day. Similarly, a product dimension may have levels such as category, subcategory, and
product.

Aggregation and Precomputation:

OLAP cubes are precomputed and aggregated at various levels of granularity to improve query
performance. Aggregation involves summarizing and consolidating data across different
dimensions. Precomputation involves calculating and storing aggregated values for faster
retrieval during analysis.

Slice, Dice, and Drill-Down:

OLAP allows users to slice, dice, and drill-down into the data cube to explore data from
different perspectives. Slicing involves selecting a subset of data along one or more dimensions.
Dicing involves viewing data from multiple dimensions simultaneously. Drill-down involves
navigating from higher-level summaries to lower-level details.

Roll-Up and Roll-Down:

OLAP enables users to roll-up or roll-down data along hierarchical dimensions to aggregate or
disaggregate data. Rolling up involves summarizing data from lower-level to higher-level
dimensions (e.g., from daily to monthly sales). Rolling down involves decomposing data from
higher-level to lower-level dimensions (e.g., from yearly to quarterly sales).

Pivoting and Rotation:

OLAP allows users to pivot or rotate the data cube to view it from different angles or
orientations. Pivoting involves reorienting the dimensions to change the perspective of analysis.
Rotation involves swapping the dimensions to explore data from alternative viewpoints.

17. Explain need and basic characteristics of Data Warehouse.


a. Centralized Repository:
A data warehouse serves as a centralized repository that consolidates data from multiple
operational sources into a single, unified location. This facilitates easy access to integrated and
standardized data for analysis and reporting purposes.

b. Subject-Oriented:

A data warehouse is designed to focus on specific subject areas or business processes relevant
to the organization. It organizes data around key business topics such as sales, marketing,
finance, and inventory, enabling users to analyze data within the context of their business
domain.

c. Time-Variant:

Data warehouses typically maintain historical data over time, allowing users to analyze trends,
patterns, and changes in data over different time periods. Historical data is valuable for trend
analysis, forecasting, and decision-making.

d. Integrated and Consistent:

Data in a data warehouse undergoes a process of integration, cleansing, transformation, and


standardization to ensure consistency and quality. This involves resolving inconsistencies,
harmonizing data formats, and reconciling discrepancies between different data sources.

18. What is Data Warehousing?


A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process.

Data warehousing:• The process of constructing and using data warehouse

19. State any two advantages of data warehouse.


Improved Decision-Making:

Data warehouses provide a centralized and integrated view of data from multiple sources,
allowing organizations to make informed decisions based on comprehensive insights. By
analyzing data stored in the data warehouse, decision-makers can identify trends, patterns, and
correlations, enabling them to make strategic decisions that drive business growth and
competitive advantage.

Enhanced Business Intelligence:

Data warehouses support advanced analytics and reporting capabilities, enabling organizations
to derive actionable intelligence from their data. With the ability to perform complex queries,
ad-hoc analysis, and generate custom reports, users can gain deeper insights into business
performance, customer behavior, market trends, and operational efficiency. This enhanced
business intelligence empowers organizations to optimize processes, improve customer
satisfaction, and capitalize on opportunities in the marketplace.

20.Explain in brief architecture of data warehousing.


Data Sources:

Data sources are where data originates from, such as operational databases, spreadsheets,
CRM systems, ERP systems, and external sources like social media platforms.

ETL Process:

ETL stands for Extract, Transform, and Load. In this process, data is extracted from various
sources, transformed to fit the data warehouse schema, and loaded into the data warehouse.

Data Warehouse Database:

The data warehouse database is the central repository that stores integrated, cleansed, and
structured data from multiple sources. It is optimized for analytical querying and reporting.

Data Warehouse Schema:

The data warehouse schema defines the structure and organization of data within the data
warehouse. Common schema designs include star schema, snowflake schema, and galaxy
schema.

Data Marts:

Data marts are subsets of the data warehouse that focus on specific subject areas or business
units. They contain pre-aggregated or summarized data tailored to the needs of a particular
user group or analytical application.

OLAP Tools:

OLAP (Online Analytical Processing) tools provide users with the ability to query, analyze, and
visualize data stored in the data warehouse. They support interactive analysis and reporting,
allowing users to gain insights and make informed decisions.

Metadata Repository:

The metadata repository stores metadata, which is data about the data in the data warehouse.
It includes information about data sources, data transformations, data definitions, and data
lineage.

21. What are the various tasks involved in data preprocessing?


•Data cleaning - Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies

•Data integration - Integration of multiple databases, data cubes, or files

•Data transformation - Normalization and aggregation

•Data reduction - Obtains reduced representation in volume but produces the same or
similar analytical results

•Data discretization - Data discretization is a data preprocessing technique used in data mining
and machine learning to reduce the number of continuous attributes (or features) by dividing
them into intervals or categories.

22. How does a snowflake schema differ from a star schema? Name two
advantages and two disadvantages of the snowflake schema.
Difference:

Star Schema: In a star schema, the dimensional model consists of a central fact table
surrounded by multiple dimension tables. Each dimension table is directly connected to the fact
table, forming a star-like structure.

Snowflake Schema: In a snowflake schema, the dimensional model is normalized, meaning that
dimension tables are further divided into sub-dimension tables. This results in a more complex
structure resembling a snowflake, with multiple levels of normalization.

Advantages of Snowflake Schema:

Reduced Redundancy: Normalization in the snowflake schema reduces data redundancy by


eliminating duplicate data in sub-dimension tables.

Improved Data Consistency: With normalized tables, updates and modifications to dimension
attributes are applied consistently across multiple sub-dimension tables, ensuring data
integrity.

Disadvantages of Snowflake Schema: Increased Complexity: The snowflake schema introduces


additional complexity due to its normalized structure, requiring more joins between tables,
which can impact query performance.

Query Performance: The snowflake schema may suffer from slower query performance
compared to the star schema, especially when dealing with deep levels of normalization and a
large number of joins.

23. Explain dimension data modelling.


Dimensional modeling represents data with a cube operation, making more suitable logical data
representation with OLAP data management. The perception of Dimensional Modeling was
developed by Ralph Kimball and is consist of "fact" and "dimension" tables.

In dimensional modeling, the transaction record is divided into either "facts," which are
frequently numerical transaction data, or "dimensions," which are the reference information
that gives context to the facts.

For example, a sale transaction can be damage into facts such as the number of products
ordered and the price paid for the products, and into dimensions such as order date, user
name, product number, order ship-to, and bill-to locations, and salesman responsible for
receiving the order.

Objectives of Dimensional Modeling

• The purposes of dimensional modeling are:

• To produce database architecture that is easy for end-clients to understand and write queries.

• To maximize the efficiency of queries. It achieves these goals by minimizing the number of
tables and relationships between them.

24. State data preprocessing and explain any one.


Data preprocessing refers to the process of preparing and cleaning raw data to make it suitable
for analysis or modeling. It involves various techniques and operations to transform, clean, and
enhance the quality of the data before it is used for data mining, machine learning, or other
analytical tasks. Some common steps involved in data preprocessing include data cleaning, data
transformation, data integration, and data reduction.

Example: Data Cleaning

Data cleaning is one of the essential steps in data preprocessing, aimed at identifying and
correcting errors, inconsistencies, and missing values in the dataset. Here's an explanation of
data cleaning:
Identifying Missing Values:

The first step in data cleaning is to identify missing values in the dataset. Missing values can
occur due to various reasons such as data entry errors, equipment malfunction, or
non-responses.

Handling Missing Values:

Once missing values are identified, they need to be handled appropriately. There are several
techniques for handling missing values, including: Deleting Rows or Columns: If the number of
missing values is small compared to the total dataset, deleting rows or columns containing
missing values may be a viable option.

Imputation:

Imputation involves replacing missing values with estimated or imputed values. Common
imputation methods include mean imputation, median imputation, mode imputation, or using
predictive models to estimate missing values.

Correcting Errors and Inconsistencies:

Data cleaning also involves correcting errors and inconsistencies in the dataset. This may
include correcting typographical errors, standardizing data formats, resolving inconsistencies
between related attributes, and reconciling conflicting information from different sources.

Handling Outliers: Outliers are data points that significantly deviate from the rest of the data.
Data cleaning may involve identifying and handling outliers using techniques such as trimming,
winsorizing, or transforming the data to reduce the impact of outliers on the analysis.

25. Define frequent sets, Support ,Confidence and association rule.


Support:

Support is a measure used in association rule mining to indicate the frequency or occurrence of
a particular itemset in a dataset.It represents the proportion of transactions in the dataset that
contain the itemset.Mathematically, support is calculated as the number of transactions
containing the itemset divided by the total number of transactions in the dataset.Higher
support values indicate that the itemset is more frequently occurring in the dataset.

Confidence:

Confidence is a measure used in association rule mining to indicate the reliability or strength of
the association between two itemsets.It represents the conditional probability that a
transaction containing one itemset also contains another itemset.Mathematically, confidence is
calculated as the number of transactions containing both itemsets divided by the number of
transactions containing the first itemset.

Frequent Sets:

Frequent sets refer to sets of items that appear together in transactions with a frequency above
a specified threshold.In association rule mining, frequent sets are crucial as they form the basis
for discovering meaningful associations between items.

Association Rule:

An association rule is a relationship between two sets of items in a dataset, typically expressed
in the form X→Y.It indicates that if itemset X appears in a transaction, then itemset Y is likely to
appear as well.

Association rules are discovered through analysis of frequent itemsets and are evaluated based
on measures like support and confidence.

26. State two draw backs of apriori algorithm.


Computational Complexity:

The Apriori algorithm can suffer from high computational complexity, especially when dealing
with large datasets or datasets with a large number of unique items.

Generating candidate itemsets and scanning the dataset multiple times to calculate support can
be computationally intensive, leading to increased runtime and memory requirements.

As the number of items and transactions in the dataset grows, the number of candidate
itemsets also increases exponentially, resulting in longer execution times and scalability issues.

Handling Sparse Data:

The Apriori algorithm may encounter challenges when dealing with sparse datasets where most
itemsets have low support.

In datasets with sparse or uneven distribution of items, the algorithm may generate a large
number of candidate itemsets, many of which may have low support and are eventually
pruned.

Sparse datasets can result in inefficient memory usage and longer runtime, as the algorithm
spends time generating and processing candidate itemsets that do not contribute significantly
to the discovery of frequent itemsets and association rules.
27. What is Apriori property?
The Apriori property is an important concept in association rule mining, particularly in the
Apriori algorithm. It states that if an itemset is frequent, then all of its subsets must also be
frequent. In other words, if an itemset is considered frequent (i.e., its support is above a
specified threshold), then all of its subsets must meet the same support threshold.

This property simplifies the process of generating candidate itemsets in the Apriori algorithm.
Instead of considering all possible combinations of items, the algorithm only generates
candidate itemsets from frequent itemsets, reducing the search space and improving efficiency.

28. Name some variants of Apriori Algorithm.

 Some variants of the Apriori algorithm include:

 FP-Growth (Frequent Pattern Growth)

 Eclat (Equivalence Class Clustering and Bottom-Up Lattice Traversal)

 Parallel and Distributed Apriori

 Incremental Apriori

 Apriori Hybrid Algorithms

29. Define Association Rule Mining

 Revealing Hidden Patterns: Association rule mining uncovers hidden patterns and
relationships within data.

 Enhancing Decision-Making: It aids in decision-making by providing insights into


customer behavior and market trends.

 Optimizing Business Strategies: Helps in optimizing business strategies such as product


placement and pricing.

 Personalizing Recommendations: Enables personalized recommendations for users


based on their preferences.

 Improving Inventory Management: Optimizes inventory levels and supply chain


management by understanding item associations.
 Detecting Fraud and Anomalies: Assists in detecting fraud and anomalies in various
domains, including finance and healthcare.

30. State various advantages of Star Schema.

 Simple Structure: Star schema has a simple and easy-to-understand structure


resembling a star, making it intuitive for users to grasp.

 Fast Query Performance: Queries in star schema are fast due to simplified joins
between the fact table and dimension tables, leading to quick data retrieval.

 Efficient Data Access: Star schema enables rapid access to specific data points within
the dataset, crucial for decision-making and analysis.

 Scalability: It accommodates growing datasets seamlessly, allowing for the addition of


new dimensions or measures without compromising performance.

 Simplified Maintenance: Changes to the schema are straightforward, reducing


maintenance efforts and ensuring agility in adapting to evolving business needs.

 Enhanced Reporting: Star schema facilitates comprehensive reporting and analysis by


providing a clear view of data relationships, aiding in informed decision-making.

 Supports OLAP Operations: It is well-suited for Online Analytical Processing (OLAP)


operations, enabling multidimensional analysis and advanced analytical tasks.

31. State the meaning of CART.


CART model i.e. Classification and Regression Models is a decision tree algorithm for building
models. Decision Tree model where the target values have a discrete nature is called
classification models.

A discrete value is a finite or countably infinite set of values, For Example, age, size, etc. The
models where the target values are represented by continuous values are usually numbers that
are called Regression Models. Continuous variables are floating-point variables. These two
models together are called CART. CART uses Gini Index as Classification matrix.

32. What is decision tree?


Decision Tree is used to build classification and regression models. It is used to create data
models that will predict class labels or values for the decision-making process. The models are
built from the training dataset fed to the system (supervised learning). Using a decision tree, we
can visualize the decisions that make it easy to understand and thus it is a popular data mining
technique.

34. Explain Bayes Theorem


Bayes’s theorem is used for the calculation of a conditional probability where intuition often
fails. Although widely used in probability, the theorem is being applied in the machine learning
field too. Its use in machine learning include the fitting of a model to a training dataset and
developing classification models.

The Formula of Bayes theorem:

Bayes theorem is a way of calculating conditional probability when the joint

probability is not available. Sometimes, the denominator can’t be directly

accessed. In such cases, the alternative way of calculating is as:

P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

This is the formulation of the Bayes theorem which shows an alternate

calculation of P(B).

P(A|B) = P(B|A) * P(A) / P(B|A) * P(A) + P(B|not A) * P(not A)

The above formula can be described with brackets around the denominator

P(A|B) = P(B|A) * P(A) / (P(B|A) * P(A) + P(B|not A) * P(not A))

Also, if we have P(A), then the P(not A) can be calculated as

P(not A) = 1 – P(A)

Similarly, if we have P(not B|not A),then P(B|not A) can be calculated as

P(B|not A) = 1 – P(not B|not A)

Bayes Theorem of Conditional Probability

Bayes Theorem consists of several terms whose names are given based on thecontext of its
application in the equation.

33. What are the advantages and disadvantages of decision tress over other
classification methods?
Advantages Of Decision Tree Classification
1. Decision tree classification does not require any domain knowledge,hence, it is appropriate
for the knowledge discovery process.

2. The representation of data in the form of the tree is easily understood by humans and it is
intuitive.

3. It can handle multidimensional data.

4. It is a quick process with great accuracy

Disadvantages Of Decision Tree Classification

1. Sometimes decision trees become very complex and these are called overfitted trees.

2. The decision tree algorithm may not be an optimal solution.

3. The decision trees may return a biased solution if some class label dominates it.

34. State two ways of pruning tree.


Pruning is the method of removing the unused branches from the decision tree.Some branches
of the decision tree might represent outliers or noisy data.Tree pruning is the method to reduce
the unwanted branches of the tree. Thiswill reduce the complexity of the tree and help in
effective predictive analysis. Itreduces the overfitting as it removes the unimportant branches
from the trees.

There are two ways of pruning the tree:

#1) Prepruning: In this approach, the construction of the decision tree is stopped early. It
means it is decided not to further partition the branches. Thelast node constructed becomes
the leaf node and this leaf node may hold themost frequent class among the tuples.The
attribute selection measures are used to find out the weightage of the split.Threshold values
are prescribed to decide which splits are regarded as useful. Ifthe portioning of the node results
in splitting by falling below threshold then theprocess is halted.

#2) Postpruning: This method removes the outlier branches from a fully grown tree. The
unwanted branches are removed and replaced by a leaf node denoting the most frequent class
label. This technique requires more computation than prepruning, however, it is more reliable.
The pruned trees are more precise and compact when compared to unpruned trees but they
carry a disadvantage of replication and repetition. Repetition occurs when the same attribute is
tested again and again along a branch of a tree. Replication occurs when the duplicate subtrees
are present within the tree. These issues can be solved by multivariate splits.

35. Explain Naive Baye’s Classification.


Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It
is not a single algorithm but a family of algorithms where all of them share a common principle,
i.e. every pair of features being classified is independent of each other. To start with, let us
consider a dataset. One of the most simple and effective classification algorithms, the Naïve
Bayes classifier aids in the rapid development of machine learning models with rapid prediction
capabilities.

The “Naive” part of the name indicates the simplifying assumption made by the Naïve Bayes
classifier. The classifier assumes that the features used to describe an observation are
conditionally independent, given the class label. The “Bayes”part of the name refers to
Reverend Thomas Bayes, an 18th-century statisticianand theologian who formulated Bayes’
theorem.Consider a fictional dataset that describes the weather conditions for playing agame
of golf. Given the weather conditions, each tuple classifies the conditions as fit (“Yes”) or unfit
(“No”) for playing golf. Here is a tabular representation of our dataset.

36. Explain different classification Techniques.


Logistic Regression:

Logistic regression is a linear model used for binary classification tasks.It estimates the
probability that a given instance belongs to a particular class using a logistic (or sigmoid)
function.Despite its name, it's a classification algorithm rather than a regression one.

Decision Trees:

Decision trees partition the feature space into regions, making decisions based on simple rules
inferred from the data.They're intuitive and easy to interpret, making them suitable for
visualization and explaining the model's predictions. Common algorithms include CART
(Classification and Regression Trees) and ID3 (Iterative Dichotomiser 3).

Support Vector Machines (SVM):

SVM is a powerful supervised learning algorithm used for classification and regression tasks. It
finds the hyperplane that best separates classes in the feature space with the maximum
margin. SVM can handle high-dimensional data and is effective even in cases where the number
of features exceeds the number of samples.

K-Nearest Neighbors (KNN):


KNN is a non-parametric lazy learning algorithm used for both classification and regression
tasks. It classifies a new instance by a majority vote of its k nearest neighbors in the feature
space.KNN doesn't involve training a model; instead, it memorizes the entire training dataset.

Naive Bayes:

Naive Bayes is a probabilistic classifier based on Bayes' theorem with a strong independence
assumption between features.Despite its simplicity, it often performs well in text classification
and other tasks with high-dimensional data.

37.Describe the essential features of decision trees in context of classification.


Splitting Criteria:

Decision trees make decisions by recursively splitting the data based on the values of
features.The splitting criterion aims to maximize the homogeneity (or purity) of the resulting
subsets. Common criteria include Gini impurity, entropy, and classification error.

Nodes:

Decision trees consist of nodes that represent decision points or questions about the
features.The initial node is called the root node, and subsequent nodes are called internal
nodes. Internal nodes represent feature attributes, and leaf nodes represent class labels.

Branches:

Branches emanate from nodes and represent the possible outcomes or decisions based on the
value of a feature. Each branch leads to a child node corresponding to a specific value of the
feature.

Decision Rules:

The decision rules at each node are based on the splitting criteria.These rules determine which
branch to follow based on the feature values of the instances.

38.Define Sequential pattern mining.


Sequential pattern mining is a data mining technique focused on identifying recurring
sequences or patterns of events/items within sequential datasets. It involves extracting
meaningful patterns that reveal the order in which events or items occur over time, enabling
insights into temporal dependencies, trends, or associations present in the data.

39.What is WEKA? What are advantages of WEKA?

WEKA stands for Waikato Environment for Knowledge Analysis. It is a popular suite of machine
learning software written in Java, developed at the University of Waikato in New Zealand.
WEKA provides a comprehensive set of tools for data preprocessing, classification, regression,
clustering, association rules mining, and visualization.

Advantages of WEKA include:

User-Friendly Interface: WEKA offers a user-friendly graphical interface that allows users to
perform various machine learning tasks without the need for extensive programming
knowledge.

Comprehensive Collection of Algorithms: WEKA provides a wide range of machine learning


algorithms, including decision trees, support vector machines, neural networks, k-nearest
neighbors, and many others. This extensive collection allows users to experiment with different
algorithms and select the most suitable one for their data.

Open-Source: WEKA is open-source software distributed under the GNU General Public License
(GPL). This means that users can access the source code, modify it according to their needs, and
contribute to its development.

Integration with Java: Since WEKA is implemented in Java, it seamlessly integrates with
Java-based applications and frameworks, making it easy to incorporate machine learning
capabilities into Java projects.

Educational Resources: WEKA is widely used in academic settings for teaching and learning
purposes. It provides numerous educational resources, including tutorials, documentation, and
example datasets, which are valuable for students and researchers studying machine learning
and data mining concepts.

Scalability: While WEKA may not be as scalable as some other machine learning libraries
designed for big data processing, it still performs well on medium-sized datasets and is suitable
for prototyping and experimentation.
40.What is Clustering? What are different types of clustering?
Clustering is a technique used in unsupervised learning to group similar data points together
based on their characteristics or features. The goal of clustering is to partition a dataset into
groups, or clusters, where data points within the same cluster are more similar to each other
than to those in other clusters.Here are different types of clustering methods:

Partitioning Methods:

Partitioning methods divide the dataset into non-overlapping clusters. Each data point belongs
to exactly one cluster.Examples include K-means clustering and K-medoids clustering.

Hierarchical Methods:

Hierarchical clustering creates a hierarchy of clusters where clusters can contain sub-clusters.It
can be agglomerative, starting with individual data points as clusters and iteratively merging
them, or divisive, starting with all data points in one cluster and recursively splitting
them.Examples include Agglomerative Hierarchical Clustering (bottom-up) and Divisive
Hierarchical Clustering (top-down).

Density-Based Methods:

Density-based methods group together data points that are densely packed in the feature
space, forming regions of high density separated by regions of low density.Clusters can be
irregularly shaped and of varying sizes.Examples include DBSCAN (Density-Based Spatial
Clustering of Applications with Noise) and OPTICS (Ordering Points To Identify the Clustering
Structure).

Distribution-Based Methods:

Distribution-based methods model the distribution of the data using probability density
functions.They assume that data points are generated from a mixture of probability
distributions.Examples include Gaussian Mixture Models (GMM) and Expectation-Maximization
(EM) clustering.

You might also like