0% found this document useful (0 votes)
27 views46 pages

6th_SEM Data Science Notes

A Data Warehouse is a centralized repository for integrated, historical data from multiple sources, supporting decision-making and analysis. It includes types such as Enterprise Data Warehouse, Data Mart, and Operational Data Store, each serving different analytical needs. Key processes involve conceptual modeling, schema design, and OLAP operations, with a focus on data security and efficient data handling.

Uploaded by

deepsaha345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views46 pages

6th_SEM Data Science Notes

A Data Warehouse is a centralized repository for integrated, historical data from multiple sources, supporting decision-making and analysis. It includes types such as Enterprise Data Warehouse, Data Mart, and Operational Data Store, each serving different analytical needs. Key processes involve conceptual modeling, schema design, and OLAP operations, with a focus on data security and efficient data handling.

Uploaded by

deepsaha345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

UNIT 1: DATA WAREHOUSE

1. What is a Data Warehouse, Types of Data Warehouse. and why is it


needed?
A Data Warehouse is a centralized repository that stores
integrated, historical, and structured data from multiple sources for
reporting and analysis. It supports decision-making by enabling efficient
querying and data mining.

Types of Data Warehouse:


1. Enterprise Data Warehouse (EDW): Covers all organizational data for
company-wide analysis.
• A centralized repository that stores integrated, historical data from
across the entire organization.
• Supports strategic decision-making, data analysis, and business
intelligence.
• Data is cleaned, transformed, and stored from multiple sources.
• Optimized for complex queries, reporting, and trend analysis.
• Data is typically read-only for users.

2. Data Mart: A smaller, focused data warehouse for specific


departments or topics.
• A subset of a data warehouse, focused on a specific subject or
department (e.g., sales, HR).
• Stores historical data for analytical and business intelligence
purposes.
• Used for complex queries, trend analysis, and reporting.
• Data is typically cleaned, integrated, and summarized.
• Supports decision-making for a specific group.
3. Operational Data Store (ODS): Stores real-time, current data for daily
operations, not for long-term analysis.
• It is a real-time or near real-time data storage system.
• It stores current operational data from multiple sources.
• Used for routine operational reporting and analysis.
• Acts as a staging area before data is moved to a data warehouse.

Need for Data Warehouse:


• Provides historical insights for trend analysis.
• Supports business intelligence (BI) and analytics.
• Integrates data from multiple heterogeneous sources.
• Enhances query performance and reporting efficiency.

2. Describe the process of conceptual modelling of a Data


Warehouse. Key features and Goals of Data Warehouse.
Conceptual modelling: Conceptual modelling defines the high-level
structure of a Data Warehouse, focusing on how data is organized and
related.
Steps in Conceptual Modelling:
1. Identify Business Needs – Define objectives and key metrics.
2. Determine Key Entities & Relationships – Identify facts and
dimensions.
3. Select Schema Type – Choose Star, Snowflake, or Galaxy schema.
4. Define Concept Hierarchies – Establish levels of data granularity
(e.g., Year → Quarter → Month).
5. Validate & Refine – Ensure the model supports business intelligence
requirements.
Key features of a Data Warehouse:
• Central Storage: Stores large amounts of data from different
sources in one place.
• Historical Data: Keeps past data for analysis over time.
• Data Integration: Combines data from various systems for better
analysis.
• Support for Decision-Making: Helps users make quick and informed
decisions.

Goals of a Data Warehouse:


• Support Decisions: Help managers make better choices with
accurate data.
• Improve Data Analysis: Make it easier to analyze large amounts of
data.
• Provide Quick Access: Allow fast retrieval of data for users.
• Combine Data: Bring together data from different sources for a
clear view.

3. Schemas in a Data Warehouse & Data Warehouse Architecture.


A schema is a blueprint or structure that defines how data is organized
in a data warehouse. It shows how different tables relate to each other.
Types of Schemas:
1. Star Schema
o Central fact table connected to multiple dimension tables.
o Simple and fast for queries.
o Example: Sales fact table linked to product, time, customer,
and region dimensions.
2. Snowflake Schema
o Extension of star schema where dimension tables are
normalized (split into sub-tables).
o More complex but saves space.
o Example: Product dimension split into product and category
tables.
3. Galaxy Schema (Fact Constellation)
o Contains multiple fact tables sharing dimension tables.
o Used for complex systems with multiple business processes.
o Example: Sales and shipment fact tables sharing customer and
product dimensions.

Data Warehouse Architecture Main Components:


1. Data Sources
o Includes operational databases, flat files, CRM, ERP systems,
etc.
o Provides raw data for analysis.
2. ETL (Extract, Transform, Load) Process
o Extract: Collects data from various sources.
o Transform: Cleans, formats, and integrates the data.
o Load: Loads data into the data warehouse.
3. Data Warehouse Storage
o Central repository for integrated and historical data.
o Optimized for query and analysis (not for transactions).
4. Metadata
o Data about the data (e.g., source, format, meaning).
o Helps in data governance and user navigation.
5. OLAP Engine
o Supports Online Analytical Processing for complex queries,
slicing, dicing, roll-up, drill-down, etc.
6. Front-End Tools / Presentation Layer
o Includes dashboards, reports, data visualization, and business
intelligence tools.
o Used by analysts and decision-makers.

Types of Architectures:
1. Single-Tier Architecture
o Rarely used.
o Combines all functions in one layer; not scalable.
2. Two-Tier Architecture
o Separates data sources from the data warehouse.
o Limited scalability and performance.
3. Three-Tier Architecture (Most Common)
o Bottom Tier: Data sources + ETL tools
o Middle Tier: Data warehouse + OLAP engine
o Top Tier: Front-end tools for data access and analysis

4. Explain the concept of Data Cubes in Data Warehouse modelling.


What is Data Warehousing Security
=> A Data Cube is a multidimensional representation of data in a Data
Warehouse, used for OLAP (Online Analytical Processing). It allows data
to be viewed from multiple perspectives by organizing it into dimensions
(e.g., Time, Product, Location) and measures (e.g., Sales, Revenue).
Key Features:
• Supports fast aggregation and slicing/dicing of data.
• Enables drill-down (detailed view) and roll-up (summarized view)
operations.
• Improves query performance for complex analytics.

Data Warehousing Security


Data warehousing security means protecting the data stored in a data
warehouse from unauthorized access, loss, or misuse. Since a data
warehouse stores a huge amount of sensitive and important data,
keeping it safe is very important.
Main Goals of Data Warehouse Security:
Confidentiality: Only the right people should see the data.
Integrity: The data should not be changed or damaged by mistake or on
purpose.
Availability: The system should work properly and be accessible when
needed.
Key Security Measures:

Authentication Checks who are trying to access the system (e.g., login ID & password).

Gives permission to users based on their role (e.g., analyst can view,
Authorization
admin can edit).

Converts data into a secret code to protect it while storing or


Data Encryption
transferring.

Access Control Limits what data a user can see or change.

Firewalls Blocks outside attacks from hackers.

Backup and Recovery Keeps copies of data in case of loss or damage.

Audit Logs Keeps records of who accessed or changed the data, and when.
5. OLAP & it’s Types.
OLAP (Online Analytical Processing) is a technology that allows users to
analyze large amounts of data quickly from different angles. It helps in
complex data analysis, like summarizing and exploring data for business
insights.
Types of OLAP:
1. MOLAP (Multidimensional OLAP): Uses specialized storage to
organize data in a cube format, making analysis fast. Good for
complex calculations.
2. ROLAP (Relational OLAP): Uses standard relational databases,
flexible and handles large data volumes, but might be slower.
3. HOLAP (Hybrid OLAP): Combines both MOLAP and ROLAP, offering
fast analysis and handling big data efficiently.

6. Describe the process of conceptual modeling of a Data Warehouse.


Conceptual modeling defines the high-level structure of a Data
Warehouse, focusing on how data is organized and related.
Steps in Conceptual Modeling:
1. Identify Business Needs – Define objectives and key metrics.
2. Determine Key Entities & Relationships – Identify facts
(measurable data) and dimensions (contextual data).
3. Select Schema Type – Choose Star, Snowflake, or Galaxy schema.
4. Define Concept Hierarchies – Establish levels of data granularity
(e.g., Year → Quarter → Month).
5. Validate & Refine – Ensure the model supports business
intelligence requirements.
7. What are concept hierarchies in a Data Warehouse?
A Concept Hierarchy is a way to organize data into multiple levels of
granularity, allowing flexible analysis. It helps in drill-down and roll-up
operations in OLAP.
Example:
• Location Hierarchy: Country → State → City → Store
• Time Hierarchy: Year → Quarter → Month → Day
Benefits:
• Enables detailed and summarized analysis.
• Improves data navigation and reporting flexibility.
• Supports faster decision-making in BI systems.

8. Measures, Dimension Attributes, and Fact Tables


Fact Tables - store quantitative data for analysis. They contain facts
(numerical values) and foreign keys that reference dimension tables.
Example: A sales fact table might include total sales amount, units sold,
and keys to product and time dimension tables.
Dimension Attributes - provide context for facts. They describe the
characteristics of the data stored in fact tables and help with data
categorization.
Example: A product dimension might include attributes like product ID,
name, category, and brand.
Measures - are the actual numerical values stored in the fact table that
you want to analyze. They represent aggregated data.
Example: In the sales table, revenue and quantity sold are measures.
9. OLAP Operations in the Multidimensional Data Model. OLAP services.
OLAP (Online Analytical Processing) operations allow users to perform
complex queries and analysis on multidimensional data. Key operations
include:
• Slice: Selects a single dimension from a cube, reducing the
dimensionality.
Example: Viewing sales data for only the year 2023.
• Dice: Selects two or more dimensions to create a sub-cube.
Example: Analyzing sales data for specific products in a particular
region.
• Drill Down (or Up): Navigates through levels of data; drilling down
goes to finer details, while drilling up summarizes data.
Example: From annual sales to monthly sales figures.
• Roll-up: Moves from detailed data to summarized data by climbing
up a data hierarchy.
Example: Daily sales → Monthly sales → Yearly sales.
• Pivot (or Rotate): Reorients the data view for easier analysis
without altering the data itself.
Example: Changing rows to columns to see product performance
across different regions.
Main OLAP Services:
• Multidimensional Data View: Organizes data into cubes with
dimensions like time, location, product, etc.
Example: View sales by product, by region, and by year.
• Fast Query Performance: Allows quick responses even for large data
sets using pre-aggregated (summary) data.
• Complex Analysis Support: Supports advanced calculations like totals,
averages, growth rates, etc.
• Security and Access Control: Gives permission-based access so only
the right people can view or change data.
10. Describe Data pre-processing and it’s steps. What is aggregate in DW.
Design Process of a Data Warehouse and Its Usage
Data preprocessing - is the first step in data mining or machine
learning. It means cleaning and preparing raw data so it can be used for
analysis or modelling. we can think of it as “cleaning and arranging the
data” before using it.
Main Steps in Data Preprocessing:
1. Data Cleaning – Fix or remove wrong, missing, or duplicate data.
Example: Filling empty values or removing errors.
2. Data Integration – Combine data from different sources.
Example: Joining sales data from two different branches.
3. Data Transformation – Convert data into a proper format.
Example: Changing dates to a common format like DD/MM/YYYY.
4. Data Reduction – Make the data smaller without losing important
info.
Example: Keeping only needed columns.
5. Data Discretization – Convert continuous data into categories.
Example: Age 0–18 = “Child”, 19–60 = “Adult”, 60+ = “Senior”

Aggregate is a summary of data made by combining it in some way —


usually by adding, counting, or averaging values. Aggregates help make
queries faster by storing already calculated results.
Example of Aggregate:
A person have daily sales:

Date Sales

Jan 1 ₹1000

Jan 2 ₹1200

Jan 3 ₹1100
Monthly Aggregate:
January Sales = ₹3300 (sum of Jan 1, Jan 2, Jan 3…)

The design process of a data warehouse includes the following steps:


1. Requirement Analysis: Identify the analytical needs of the
business.
2. Choosing the Right Model: Decide between a star schema or
snowflake schema based on requirements.
3. Data Source Integration: Gather and integrate data from various
sources.
4. Data Transformation: Cleanse and transform the data into a
suitable format for analysis.
5. Load Data (ETL): Extract, Transform, Load (ETL) the data into the
data warehouse.
Usage in Analytical Processing: Once established, the data
warehouse supports complex queries and data analysis, enabling
decision-making through tools like OLAP. It enhances reporting, trend
analysis, and predictive modeling, providing valuable insights.

11. Multidimensional Data Model in Data Warehousing.


Multidimensional Data Mining in Data Warehousing.
Multidimensional Data Model - This is a way of organizing and storing
data in the form of a data cube.
It is mostly used in OLAP systems.
• It lets us view data across multiple dimensions (like time, location,
product, etc.).
• It’s good for reporting, summarizing, and quick analysis.
• Example: we can view total sales by product, by month, and by
region.
Purpose: Mainly for fast analysis and business reporting.
Multidimensional Data Mining - helps find patterns and insights from
data organized in various directions. Here’s why it’s important:
• Better Analysis: It lets businesses look at data in different ways (like
time, place, and products) to spot trends and connections.
• Smart Decisions: The insights gained help companies make better
plans and choices.
• Understanding Customers: Businesses can learn more about what
customers like and how they behave.
• Increased Efficiency: This method makes it quicker and easier to
analyze data compared to older techniques.
Example: A retail store might analyze customer buying habits by looking
at sales data by time, location, and product type. This helps them manage
stock and improve marketing efforts.

1. Top-Down Approach in DW
• Proposed by Bill Inmon (Father of Data Warehousing)
• Build the main data warehouse first (central storage), then create
data marts (smaller parts) for specific departments like sales, HR,
finance, etc.
Features:
• Focus on the entire organization
• Data is cleaned and integrated in the warehouse
• Good for long-term, large systems
Example:
Build the full warehouse, then create small sections for sales reports or
customer info.
2. Bottom-Up Approach in DW
• Proposed by Ralph Kimball
• Start by building data marts first for each department, then
combine them to form the complete data warehouse.
Features:
• Faster to implement
• Good for quick business needs
• Easier to manage smaller sections
Example:
First create a sales data mart, then later combine with HR and finance
marts to form the full warehouse.

UNIT 2: INTRODUCTION TO DATA MINING


12. Data Mining and the KDD(Knowledge Discovery in Databases)
steps.
Data Mining is finding patterns and connections in large amounts of data
to help make smart decisions or predictions. It uses statistical methods
and machine learning to analyze data and get useful information.
The Knowledge Discovery Process includes several simple steps:
1. Data Selection: Choosing the relevant data from various sources for
analysis.
Example: Collecting customer transaction data for analysis.
2. Data Preprocessing: Cleaning and transforming data to remove
noise and deal with missing values.
Example: Deleting duplicate records from the dataset.
3. Data Transformation: Converting data into a suitable format for
mining (e.g., normalization, aggregation).
Example: Summing up sales data by month instead of looking at
daily sales.
4. Data Mining: Applying algorithms to extract patterns and
knowledge from the pre-processed data.
Example: Grouping customers based on their buying habits.
5. Interpretation and Evaluation: Study the patterns found and check
that they are useful and relevant.
Example: Understanding how the identified customer groups will
affect business decisions.
6. Deployment (Knowledge Presentation): Use the insights in real-life
situations to help with decision-making.
Example: Creating targeted marketing campaigns for different
customer groups.

13. Data Mining architecture. Different Types of Data Repositories


Used in Data Mining
Main Components of Data Mining Architecture:
1. Data Sources
o Databases, data warehouses, flat files, or web data
o This is where raw data comes from
2. Data Warehouse (or Database Server)
o Stores the collected and integrated data
o Prepares data for mining (cleaned, formatted)
3. Data Mining Engine
o The core part of the system
o Applies mining algorithms like classification, clustering,
association, etc.
4. Pattern Evaluation Module
o Filters useful and interesting patterns from the mined results
o Removes unimportant or repeated patterns
5. Knowledge Base
o Stores rules, past results, user feedback, and domain
knowledge
o Helps guide and improve mining process
6. User Interface
o Allows users to interact with the system
o Users can give queries, set mining tasks, and see results
through graphs or reports

Data mining utilizes several types of data repositories, including:


1. Databases: Organized collections of data stored in systems like
relational databases which use tables or NoSQL databases which are
more flexible.
Example: A SQL database storing sales records.
2. Data Warehouses: Large systems that store historical data collected
from different sources, mainly used for analysis and reporting. They
organize data by categories, making it easier to analyze trends over
time.
Example: A data warehouse that combines data from various stores in
a retail chain.
3. Data Lakes: Places where all kinds of raw data are stored in their
original form. They can hold many types of data, like logs, images,
videos, or text, and are flexible for different analysis methods.
Example: An Apache Hadoop system that stores different types of files
like logs, images, and videos.
4. Flat Files: Simple files like text or CSV files that store structured or
unstructured data. They are easy to use and view, often used for small
projects or data transfer.
Example: A spreadsheet with customer survey answers.
5. Web and Social Media Data: Data collected from online platforms like
websites, social media, or forums. Used to analyse trends, opinions, or
user behaviour.
Example: Analysing Twitter posts to gauge public sentiment about a
new product.

14. Functionalities of Data Mining


Data mining includes different functions that help extract useful
insights from data:

1. Classification: Sorting data into predefined categories


based on learned patterns.
Example: Classifying emails as spam or not spam using
decision trees.

2. Regression: Predicting a continuous number based on


input data.
Example: Estimating house prices based on size, location,
and age.

3. Clustering: Grouping similar data points together without


predefined labels.
Example: Grouping customers based on their buying habits
for better marketing.

4. Association Rule Learning: Finding interesting relationships


or rules between items in large data sets.
Example: Noticing that customers who buy bread often also
buy butter.

5. Anomaly Detection: Identifying rare or unusual data points


that stand out from normal patterns.
Example: Detecting fraud in credit card transactions.

6. Text Mining: Pulling useful information from text data.


Example: Analysing customer feedback for sentiment.
16. Data Mining Tasks and Trends
Data Mining Tasks (functionalities)
1. Classification: Sorting data into defined categories.
Example: Identifying emails as spam or not spam.
2. Regression: Predicting a number based on input data.
Example: Estimating a house price based on its features.
3. Clustering: Grouping similar data points without labels.
Example: Segmenting customers by their buying habits.
4. Association Rule Learning: Finding relationships between items in
large datasets.
Example: Noticing that customers who buy milk usually buy bread.
5. Anomaly Detection: Spotting outliers that are very different from
the rest of the data.
Example: Identifying fraudulent credit card charges.
6. Text Mining: Pulling useful information from text data.
Example: Analyzing customer feedback for sentiment.

Trends in Data Mining


1. Big Data Analytics: Combining large datasets with traditional
mining methods.
2. AI and Machine Learning: More use of AI/ML algorithms for
better data analysis.
3. AutoML: Making machine learning easier and accessible for
everyone.
4. Real-time Data Processing: Focusing on analyzing data instantly
for quick decisions.
17. Major Issues Associated with Data Mining
1. Data Privacy: Concerns about the storage and analysis of personal
data can lead to privacy violations.
o Example: Mining social media data without user consent may
breach privacy laws.
2. Data Quality: Poor-quality data can lead to inaccurate results and
insights; thus, data cleansing is crucial.
o Example: Incomplete or inconsistent transaction records can
skew sales analysis.
3. Overfitting: Creating overly complex models that perform well on
training data but fail on unseen data.
o Example: A model that learns noise in the training dataset
rather than generalizable patterns leading to poor predictive
performance.
4. Interpretability: Complex models, especially in deep learning, can
be difficult to understand, making it hard for users to trust insights.
o Example: A black-box model that offers predictions without
clear explanations can lead to skepticism about its reliability.

18. What is a Decision Tree? Decision Tree-Based Algorithms.


Advantages and Disadvantages.
A Decision Tree is a tree-like model used in machine learning and data
mining to make decisions or predictions based on data.
• Each node represents a question or condition based on a feature
(e.g., "Is age > 18?").
• Each branch represents the answer (Yes/No).
• Each leaf node gives the final result (e.g., "Allow Loan" or "Deny
Loan").
Why Use a Decision Tree?
• Easy to understand and explain
• Works with both numbers and text data
• Can handle missing data
• Used for classification and regression problems
Decision Tree-Based Algorithms:

Algorithm Description

ID3 (Iterative Dichotomiser 3) Uses information gain to split data. Simple and fast.

An improved version of ID3. Handles missing values


C4.5
and both continuous & discrete data.

CART (Classification and Uses Gini Index. Can be used for both classification
Regression Tree) and regression tasks.

A group (ensemble) of many decision trees. More


Random Forest
accurate and stable.

Gradient Boosted Trees (GBM, Builds trees one after another, each improving the
XGBoost, etc.) previous. Very powerful and accurate.

Advantages of Decision Tree: Disadvantages of Decision Tree:

• Very easy to understand and explain • Can overfit (too complex for small data)

• Works with both numbers and categories • Unstable — small data changes can
change the tree

• Needs little data cleaning • Not always very accurate

• Fast to build and use • Can be biased toward features with more
categories
• Can handle missing values
UNIT 3: ASSOCIATION AND CORRELATION ANALYSIS
18. Basic Concepts of Association Rule Learning.
1. Items: The individual products or features in the dataset.
Example: Milk, bread, and eggs.
2. Itemsets: Groups of items that appear together in transactions.
Example: {Milk, Bread} is an itemset.
3. Support: Support shows how often an itemset appears in all
transactions. It tells us the popularity of item combinations.
Example: If out of 100 transactions, 20 include both Milk and
Bread, the support for {Milk, Bread} is 20%.
This helps identify common item combinations.
4. Confidence: Confidence measures the probability that a customer
who buys one item will buy another item as well.
Example: If 80% of customers who buy Bread also buy Milk, the
confidence of {Bread} → {Milk} is 80%.
It shows the strength of the rule "If A, then B".
5. Lift: Lift compares how often two items occur together versus if
they were independent. A lift > 1 means they occur together more
often than by chance.
Example: If the lift for {Bread} and {Milk} is 1.5, it means buying
Bread increases the chance of buying Milk by 50% compared to
random chance.
Lift helps understand if items truly influence each other..

19. How Association Rule Learning Works


1. Data Collection: Gather transaction data, like what items are
bought together.
Example: Records from a grocery store.
2. Frequent Itemset Generation: Identify item-sets that meet a
minimum support threshold.
Example: Finding pairs like {Milk, Bread} that appear in at least 20%
of transactions.
3. Rule Generation: Create rules based on frequent item-sets, focusing
on those with high confidence.
Example: If customers frequently buy {Milk, Bread}, a rule might be
"If a customer buys Bread, they are likely to buy Milk."
4. Evaluation: Analyse the rules using metrics like support, confidence,
and lift to find the strongest relationships.
Example: Only keep rules that have high confidence and lift for
better insights.
5. Application: Use the rules to inform marketing strategies, such as
product placement or cross-selling.
Example: Placing milk and bread close to each other in a store to
boost sales.

19. Basic methods for data cleaning. define data reduction


techniques.
Common Data Cleaning Methods:
1. Handling Missing Data
o Fill in with average, most common value, or use prediction
o Or, remove rows with too many missing values
2. Removing Duplicates
o Identify and delete repeated records
3. Correcting Inconsistent Data
o Fix errors like wrong spelling, date formats, or case mismatch
o Example: "Kolkata" vs "kolkata"
4. Filtering Noise
o Remove outliers or irrelevant data that can affect results
5. Standardizing Data
o Convert data into a common format
o Example: Date format as DD/MM/YYYY in all rows
Data Reduction Techniques: Data reduction means reducing the size of
data without losing important information. It helps in faster analysis and
saves storage.
Types of Data Reduction Techniques:
1. Dimensionality Reduction
o Removes unimportant columns/features
o Example: Using PCA (Principal Component Analysis)
2. Numerosity Reduction
o Replaces original data with a smaller form (summary or
model)
o Example: Storing average sales per month instead of daily
records
3. Data Compression
o Uses encoding techniques to store data in less space
o Example: ZIP files, or encoding values with fewer bits
4. Data Aggregation
o Summarizes detailed data
o Example: Convert daily sales data into monthly totals

20. Explain the apriori Algorithm of association rule mining and Its Steps
& provide example. State the apriory properties
The Apriori Algorithm is a simple method in data mining to find
common groups of items in large datasets. It helps identify patterns, like
which products are often bought together, for example, bread and butter.
Here are its steps:
Apriori Algorithm: Step-by-Step
Step 1: Find the frequent 1-itemsets
➢ Count the support (occurrence) of each individual item in all
transactions.
➢ Keep only those items whose support ≥ minimum support
threshold.
Step 2: Generate candidate k-itemsets (k > 1)
➢ Use the frequent (k-1)-itemsets to generate new candidate k-
itemsets by joining pairs that share (k-2) items.
Step 3: Prune candidate k-itemsets
➢ Remove candidate k-itemsets if any of their (k-1)-subsets are not
frequent (based on the Apriori property).
Step 4: Count support of candidate k-itemsets
➢ Scan the dataset and count how many transactions contain each
candidate.
➢ Keep only those with support ≥ minimum support.
Step 5: Repeat
➢ Repeat steps 2-4 for larger k until no new frequent itemsets are
found.
Example:
• Transactions: {Bread, Butter}, {Bread}, {Butter, Milk}, {Bread, Milk}.
• Minimum support = 50%.
1. 1-itemsets:
{Bread} → 3/4 = 75% (keep), {Butter} → 2/4 = 50% (keep),
{Milk} → 2/4 = 50% (keep).
2. 2-itemsets:
Check pairs: {Bread, Butter} → 1/4 = 25% (discard),
{Bread, Milk} →2/4 = 50% (keep), {Butter, Milk} → 1/4 = 25% (discard).
3. Results:
o Frequent itemsets: {Bread}, {Butter}, {Milk}, and {Bread, Milk}.

Apriori Properties:
1. Anti-monotonicity:
o If an itemset is frequent, all its subsets are also frequent.
o If an itemset is infrequent, any superset of it will also be
infrequent.
2. Downward Closure Property:
o Used to reduce the search space in frequent itemset mining.
o Helps in pruning the itemsets that are unlikely to be frequent.
In simple terms:
• Big itemsets can only be frequent if all their smaller parts are
frequent.
• This property makes algorithms like Apriori efficient for finding
frequent itemsets.

21. Applications of Association Rule Learning in Real-World Scenarios


1. Market Basket Analysis: Retailers analyze purchase patterns to place
related products together and boost sales.
Example: People who buy diapers also tend to buy baby wipes.
2. Recommendation Systems: Online platforms recommend products
based on user behavior and purchase history.
Example: Streaming services suggesting shows based on viewing
habits.
3. Fraud Detection: Financial institutions identify unusual patterns in
transactions to detect fraud.
Example: Detecting unexpected large purchases after a series of small
ones.
4. Customer purchase pattern analysis: Businesses group customers
based on purchasing habits to target marketing efforts.
Example: Identifying loyal customers who frequently buy certain
brands.
5. Healthcare: Analyzing patient data to find associations between
symptoms and diseases for better diagnosis.
Example: Finding that patients with certain symptoms often have a
specific illness.

Applications (short):
1. Market basket analysis (e.g., "If buys bread, likely to buy butter").
2. Recommender systems (e.g., Netflix, Amazon).
3. Customer purchase pattern analysis.
4. Website clickstream analysis.
5. Intrusion/fraud detection.
6. Medical diagnosis (e.g., symptom-disease associations).

UNIT 4: CLUSTERING ALGORITHMS AND CLUSTER ANALYSIS


23. Basic Idea of Unsupervised Learning in Clustering. Clustering and
different types of Clustering Methods.
Unsupervised learning is a type of machine learning where the model learns
patterns from data without pre-labeled categories. The main idea of
clustering is to group similar data points together based on their features.
• No Labels: Unlike supervised learning, there are no predefined classes
or labels for the data.
• Natural Groupings: The algorithm finds natural groupings or clusters
within the data.
• Applications: Clustering helps in discovering patterns, customer
segmentation, and organizing large datasets.
Clustering: It is a data mining technique used to group similar data items
together. It finds patterns or similarities in data and divides them into
clusters (groups), where data in the same cluster are more similar to each
other than to those in other clusters.
• To understand patterns in data
• To organize or classify data without using labels
• Used in market research, image recognition, customer segmentation, etc.
Types of Clustering Methods:

Divides data into non-overlapping groups. You decide the number of


1. Partitioning Method
clusters (e.g., K-Means).

Creates a tree-like structure of clusters, either by merging or splitting.


2. Hierarchical Clustering
No need to pre-define cluster count.

3. Density-Based Forms clusters based on areas of high density in data. Can find clusters
Clustering of any shape (e.g., DBSCAN).

Divides the data space into a grid and then forms clusters based on grid
4. Grid-Based Clustering
cells (e.g., STING).

Assumes a model for each cluster and fits the data accordingly (e.g.,
5. Model-Based Clustering
Expectation-Maximization using Gaussian models).

24. What is K-Medoids Clustering (PAM)?


K-Medoids clustering (Partitioning Around Medoids, PAM) is a clustering
technique that chooses actual data points as centers (medoids) instead of
averages used in K-Means.
25. Explain the K-Means Clustering Algorithm
The K-Means algorithm is a popular method used for clustering data
into K number of groups. Here's how it works:
1. Choose K: Select the number of clusters (K) you want to create.
Example: If you want to group data into 3 clusters, K = 3.
2. Initialize Centroids: Randomly select K points from the data as initial
centroids (cluster centers).
3. Assign Clusters: Assign each data point to the nearest centroid,
forming K clusters.
Example: Points closer to one centroid are grouped together.
4. Update Centroids: Calculate the new centroids by finding the average
position of all points in each cluster.
5. Repeat: Continue the process of assigning points to clusters and
updating centroids until the centroids no longer change significantly.
Result: The final clusters show how the data points are grouped based on
similarity.

26. Describe Hierarchical Clustering and Its Types


Hierarchical Clustering is a clustering method that builds a hierarchy of
clusters. It can be divided into two main types:
1. Agglomerative Clustering (Bottom-Up)
• Starts with each point as its own cluster
• Merges the most similar clusters step by step
• Ends when all points are merged into one big cluster or until the
desired number of clusters is reached
Steps:
1. Treat each data point as a separate cluster.
2. Measure distance between all clusters (e.g., single linkage, complete
linkage).
3. Merge the two closest clusters.
4. Repeat until you're left with one cluster or desired number of clusters.
2. Divisive Clustering (Top-Down)
• Starts with all data in one cluster
• Splits the cluster into smaller groups recursively
• Continues until each point is its own cluster or a stopping condition
is met
Steps:
1. Put all data points into one large cluster.
2. Split the cluster based on differences among data.
3. Keep dividing until every point is in its own cluster or until you reach
the desired number of clusters.
3. Dendrogram: A dendrogram is a tree-like diagram used to show how
clusters are merged or split during hierarchical clustering.
It helps to visualize the structure and decide how many clusters to
choose.
Example (Agglomerative Clustering):
Let's say we have 4 points: A, B, C, and D.
1. Start: A, B, C, D are all separate clusters
2. Merge the two closest → (A, B), C, D
3. Merge the next closest → (A, B), (C, D)
4. Final merge → (A, B, C, D)

27. What is Graph-Based Clustering, and How is It Implemented?


=> Graph-Based Clustering involves representing data as a graph, where
data points are nodes and edges represent relationships or similarities
between them. Clusters are formed by grouping connected nodes.
Implementation Steps:
1. Create a Graph:
o Represent data points as nodes.
o Define edges based on similarity, often using a distance metric
(e.g., Euclidean distance).
2. Build the Adjacency Matrix:
o Create a matrix to represent which nodes (data points) are
connected. An entry indicates if a connection (edge) exists.
3. Choose a Clustering Algorithm:
o Popular methods include Spectral Clustering, which uses
eigenvalues of the Laplacian matrix, and Community Detection
Algorithms like Louvain method.
4. Cluster Identification:
o Identify clusters by finding connected components in the graph
or by partitioning the graph into subgraphs.
5. Refinement:
o Sometimes, refine clusters based on metrics like modularity to
ensure they are meaningful.

28. Explain the Basics of Cluster Analysis and Its Importance


Cluster Analysis is a statistical technique used to group a set of objects
such that objects in the same group (cluster) are more similar to each
other than to those in other groups.
Importance:
1. Data Exploration: Helps in understanding the underlying patterns
and structures in the data.
2. Customer Segmentation: Businesses can group customers based on
behavior or preferences for targeted marketing.
3. Anomaly Detection: Identifying outliers can help in fraud detection
or finding data entry errors.
4. Recommendation Systems: Improves recommendations by
grouping similar items or users.
5. Image Segmentation: Useful in computer vision to group pixels with
similar characteristics.

29. How Can Clusters Be Evaluated? Discuss the Metrics Used.


Clusters can be evaluated using several metrics to assess their quality and
validity:
1. Silhouette Score: Measures how similar an object is to its own
cluster compared to other clusters.
o Ranges from -1 to +1: A higher score indicates better-defined
clusters.
2. Davies-Bouldin Index: This metric computes the ratio of the sum of
within-cluster scatter to between-cluster separation.
o Lower values indicate better clustering.
3. Dunn Index: Evaluates the ratio of the smallest inter-cluster
distance to the largest intra-cluster distance.
o Higher values suggest better clustering.
4. Within-Cluster Sum of Squares (WCSS): Measures the compactness
of clusters by calculating the total variance within each cluster.
o Lower values indicate tighter clusters.
5. External Validation Indices: Such as Adjusted Rand Index (ARI),
which compares the clustering with a ground truth label set.

30. What Techniques Are Used for Outlier Detection and Analysis in
Datasets?
Outlier detection identifies data points that significantly differ from
others. Common techniques include:
1. Statistical Methods:
o Use statistical tests (e.g., z-scores) to find points that fall
beyond a certain threshold (e.g., 3 standard deviations from
the mean).
2. Distance-Based Methods:
o Measure the distance of each point from others. Points far
away (e.g., using k-nearest neighbors) can be considered
outliers.
3. Clustering-Based Methods:
o Cluster data and classify points that do not belong to any
cluster or are far from cluster centroids as outliers.
4. Isolation Forest:
o A machine learning method that creates random partitions to
isolate outliers; it’s effective for high-dimensional data.
5. LOF (Local Outlier Factor):
o Measures the local density of data points, flagging those that
have significantly lower density compared to their neighbors.

UNIT 5: CLASSIFICATION
31. Define Supervised Learning and the Classification Technique
Supervised Learning is a type of machine learning where a model is
trained on a labeled dataset, meaning each training example includes
both the input data and the correct output (label). The goal is to learn a
mapping from inputs to outputs so the model can make predictions on
new, unseen data.
Classification Technique: Classification is a specific type of supervised
learning used to categorize input data into predefined classes or labels.
The model learns from labeled data and then predicts the class for new
data points.
• Example: Classifying emails as "spam" or "not spam" based on
features like keywords and sender information.
32. Discuss the Issues Related to Classification in Data Mining
Classification in data mining faces several challenges:
1. Imbalanced Datasets:
o When one class has significantly more samples than others,
leading to biased predictions.
o Example: In fraud detection, there are many legitimate
transactions and few fraud cases.
2. Overfitting:
o When the model learns the training data too well, capturing
noise instead of the underlying pattern, resulting in poor
performance on new data.
3. Underfitting:
o Occurs when the model is too simple to capture the
underlying trends in the data, leading to low accuracy.
4. Feature Selection:
o Choosing the right features is crucial. Irrelevant or redundant
features can degrade model performance.
5. Noise in Data:
o Inaccurate or noisy data can mislead the model during
training, resulting in incorrect classifications.

34. Explain Classification, Bayesian Classification & it’s features and the
Naïve Bayes Algorithm.
Classification is a data mining technique used to predict categories or
classes for data based on past information. Example:
• Email: Spam or Not Spam
• Customer: Will Buy or Not Buy
• Patient: Sick or Healthy
Bayesian Classification is a method based on Bayes' Theorem, which uses
probability to classify data. It calculates the likelihood of each class given the
input features and assigns the class with the highest probability.

Feature Explanation

Simple & Fast Easy to build and works well with large data sets

Uses Probabilities Predicts using probability, not hard rules

Works with Text Data Commonly used for spam detection, sentiment analysis

Handles Missing Values Can still predict even if some data is missing

Based on Bayes’ Theorem Uses mathematical logic to make predictions

Assumes Independence Assumes features don’t affect each other (this is the “naive” part)

Naïve Bayes Algorithm


The Naïve Bayes algorithm is a straightforward implementation of Bayesian
classification that assumes independence among features. Its key steps are:
1. Calculate Prior Probability: Determine the probability of each class based
on the training data.
Example: If 70% of emails are not spam, the prior probability of "Not
Spam" is 0.7.
2. Calculate Likelihood: Compute the probability of each feature given a class.
Example: Find the probability of a specific word appearing in spam versus
non-spam emails.
3. Apply Bayes’ Theorem: Combine the prior probabilities and the likelihoods
to get the posterior probability for each class.
Example: Use Bayes’ Theorem to calculate the probability that an email is
spam based on its features.
4. Classify: Assign the class with the highest probability to the input data.
Example: Naïve Bayes is commonly used for email spam detection,
sentiment analysis, and document classification.
35. Describe Association-Based Classification and Provide Examples
Association-Based Classification is a method that combines association
rule mining with classification. It uses association rules to help classify
new instances based on discovered patterns in the data.
How It Works:
1. Generate Association Rules: Use algorithms (like Apriori) to find
relationships in the data.
o Example: Find that "customers who buy milk often buy bread."
2. Build a Classifier: Create a classifier using these rules to predict
classes for new data.
o Rules might say, "If a customer buys bread and milk, classify
them as 'Frequent Shopper'."
3. Classification: For a new instance, apply the relevant rules to
predict its class.
Examples:
• Market Basket Analysis: Predicting customer behavior based on
their purchase history.
• Healthcare: Classifying patients based on symptoms and treatment
combinations found in previous medical records.

36. What is Rule-Based Classification, and How is It Applied in Data


Mining?
Rule-Based Classification is a machine learning approach that uses a set
of "if-then" rules to classify data. Each rule describes a condition that,
when met, leads to a specific classification outcome.
How It Works:
1. Rule Generation: Identify patterns in the training data to form rules.
o Example: "If Age > 50 and Blood Pressure > 140, then classify
as 'High Risk'."
2. Rule Evaluation: Assess rules for their accuracy and usefulness
based on training data.
3. Classification: For new data, evaluate which rules apply and assign
the corresponding class.
Applications:
• Medical Diagnosis: Classifying diseases based on symptoms and
test results.
• Customer Segmentation: Categorizing customers into different
groups for targeted marketing.
Examples: Systems like decision trees and fuzzy logic classifiers often use
rule-based methods to make predictions.

UNIT 6: WEB MINING


37. Define Web Mining and Its in Data Analysis. write down application
of web mining
Web Mining is the process of applying data mining techniques to extract
useful information and patterns from web data, including web content,
web structure, and web usage.
Significance in Data Analysis:
• Helps understand user behavior.
• Improves website structure and content.
• Enables personalized marketing and recommendations.
• Supports decision-making based on user interaction data.
1. User Behavior Insights: Helps organizations understand how users
interact with their websites and content.
o Example: Analyzing click patterns to improve website
navigation.
2. Personalization: Enables tailored experiences for users by
recommending content or products based on their preferences.
o Example: E-commerce sites suggesting items based on past
purchases.
3. Market Research: Provides data for analyzing trends, customer
preferences, and competitors.
o Example: Social media analysis to gauge public opinion on
products.
4. SEO Optimization: Improves search engine rankings by understanding
which keywords and content work best.
o Example: Analyzing top-ranking websites for keyword
utilization.
5. Data Integration: Combines information from various web sources for
comprehensive data analysis.
o Example: Compiling data from forums, blogs, and research
papers for insights on a specific topic.

Applications of Web Mining :


1. Personalized recommendations (e.g., YouTube, Amazon).
2. Search engine optimization and ranking.
3. Targeted online advertising.
4. Customer behavior analysis on websites.
5. Fraud detection in online systems.
6. Web content classification and clustering.

38. Discuss How the Web Page Layout Structure Can Be Mined
Mining the web page layout structure involves analyzing how information
is organized on a webpage to extract meaningful patterns and insights.
Here’s how it can be done:
1. HTML Structure Analysis: Examine the HTML tags and their
arrangement to understand the page layout.
o Example: Identifying headers, footers, sidebars, and main
content areas.
2. Element Extraction: Use parsers to extract specific elements like
text, images, and links based on their tags.
o Example: Extracting all image tags (<img>) to analyze visual
content.
3. Layout Patterns: Identify common layout patterns across different
web pages.
o Example: Determining if most product pages follow a specific
design template.
4. User Interaction Study: Analyze how users interact with different
layout components.
o Example: Observing which sections of the page users click on
most frequently.
5. Dynamic Content Identification: Detect areas that change
frequently (like news sections) versus static content.
o Example: Marking areas that update daily versus those that
remain unchanged.

Section Description

Top part of the page. Usually contains the logo, site name, and
1. Header
navigation menu (like Home, About, Contact).

Menu bar for moving between different pages or sections of the


2. Navigation Bar (Navbar)
site. Often inside the header.

The central part where the main content appears — text, images,
3. Main Content Area
articles, videos, etc.

A side section (left or right) used for links, ads, categories, or


4. Sidebar (Optional)
extra info.

Bottom part of the page. Usually contains contact info, copyright,


5. Footer
terms of use, and social media links.

39. Explain the Process of Mining Web Link Structure


Mining the web link structure involves analyzing the relationships
between web pages based on hyperlinks. This process helps understand
the connection and flow of information on the web.
Steps in Mining Web Link Structure:
1. Crawling: Use web crawlers to navigate and collect data from web
pages, including their links.
o Example: A crawler visits a page and records all outbound links.
2. Building a Link Graph: Create a graph where nodes represent web
pages and edges represent hyperlinks between them.
o Example: A graph shows how different articles on a news
website are interconnected through links.
3. Analyzing Link Properties:
o In-Degree: Count how many links point to a page; a higher
count may indicate popularity.
o Out-Degree: Count how many links a page points to; this
reflects how much information it shares.
4. Identifying Communities: Use clustering algorithms to find groups of
pages that are closely linked.
o Example: Identifying a cluster of pages that focus on a specific
topic or theme.
5. Applying Metrics: Calculate important metrics like PageRank, which
assesses the influence of a page based on its links.
o Example: Pages with higher PageRank are considered more
authoritative.

40. How is Multimedia Data Analyzed on the Web?


Multimedia data analysis involves examining various forms of content on
the web, including text, images, audio, and video. The aim is to extract
useful information and insights. Here are the main steps involved:
1. Content Extraction: Use tools to extract data from multimedia
sources.
o Example: Extracting audio transcriptions from podcasts or
captions from videos.
2. Feature Extraction: Identify important features from multimedia
content.
o Text: Analyze keywords and sentiment using natural language
processing (NLP).
o Images: Use techniques like edge detection and color
histograms to analyze images.
o Audio: Extract features like pitch, volume, or frequency
patterns.
3. Metadata Analysis: Analyze metadata (information about data)
associated with multimedia files.
o Example: Analyzing tags, descriptions, or timestamps to
understand content better.
4. Classification and Clustering: Use machine learning techniques to
categorize or group multimedia data.
o Example: Grouping similar images, videos, or audio files based
on content.
5. User Interaction Analysis: Study how users engage with multimedia
content.
o Example: Tracking views, likes, shares, and comments to
determine popularity and engagement levels.

41. What is Distributed Data Mining?


Distributed Data Mining (DDM) is a method of mining data that is stored
across multiple locations or systems, rather than in a single central
database.
Application: DDM is useful in scenarios where data is generated from
multiple sources, such as across different branches of a company or IoT
devices, enabling efficient mining without needing to centralize all data.
➢ Meta Data: Metadata means “data about data”. It provides
extra information that describes, explains, or helps organize
actual data. we don’t need to open the real data to know what it
is — metadata tells us the key details.
• Helps to understand the data
• Makes searching and managing data easier
• Useful for data analysis, backup, security, and data warehousing
In Data Warehousing Metadata describes source of the data,
transformation rules, table structures, and data load time. It helps in
ETL (Extract, Transform, Load) processes and in managing the data
warehouse efficiently.
Types of Metadata:

Describes the content (e.g., title, author,


Descriptive Metadata
summary)

Explains how data is organized (e.g., fields,


Structural Metadata
columns, links)

Technical or management info (e.g., creation


Administrative Metadata
date, access rights)
41. Difference Between KNN and K-Means

KNN (K-Nearest Neighbors) K-Means Clustering

Supervised Learning used for classification Unsupervised Learning used for clustering.(
and regression Customer grouping, pattern discovery)

Finds the nearest neighbors of a data point Divides data into a fixed number of clusters
to make a prediction. based on similarity.

No training phase Has a training (fitting) phase

Looks at 'k' nearest labeled neighbors Creates 'k' clusters and assigns data to them

It requires labeled data to predict the It works with unlabeled data to group similar
category of new data. items together.

Sensitive to irrelevant features and requires Sensitive to the initial choice of centroids and
feature scaling. may not work well with non-spherical clusters.

44. Difference Between Data Mining and Web Mining

Feature Data Mining Web Mining

Discovering useful patterns or Extracting useful information from web


Definition
insights from large datasets. data (web pages, logs, etc.).

Analyzing structured data like Analyzing unstructured web data like


Focus
databases and data warehouses. web pages, user logs, and links.

Find hidden patterns, trends, or Understand web content, user


Purpose
rules in data. behavior, and web structure.

Customer segmentation, market Personalized recommendations, search


Applications
analysis, fraud detection. ranking, ads, web content analysis.
43. Difference between Supervised and Unsupervised machine
Learning.
Feature Supervised Learning Unsupervised Learning

Uses unlabeled data (no predefined


Data Type Uses labeled data (data with answers)
answers)

Goal Predict output from input Find hidden patterns or groupings

Output Type Known (e.g., class or value) Unknown (e.g., clusters, patterns)

Examples Classification, Regression Clustering, Association

Real-Life Email spam detection (Spam/Not Customer segmentation (grouping


Example Spam) customers)

Algorithms Decision Tree, KNN, Naive Bayes, K-Means, Hierarchical Clustering,


Used Linear Regression Apriori

45. Differenciate Traditional Data Mining and Distributed Data Mining.

Traditional Data Mining Distributed Data Mining

Centralized data stored in one location. Data spread across multiple locations or sites.

Data may be located in various databases;


Direct access to a single dataset.
access requires dealing with distribution.

Limited by the resources of a single Can scale by adding more nodes or systems into
server. the network.

Centralized environments can expose Data remains local; only patterns are shared,
all data. improving privacy.

Possible bottlenecks due to centralized Parallel processing across multiple nodes speeds
processing. up analysis.
46. Differentiate Between Binary Classification and Multiclass
Classification

Binary Classification Multiclass Classification

Classifies data into two distinct classes. Classifies data into three or more classes.

Spam vs. Not Spam; Pass vs. Fail. Classification of images as Cat, Dog, or Bird.

Produces one of two possible outcomes Produces one of several possible outcomes.

Uses Logistic Regression, SVM, Decision Softmax Regression, Multi-class SVM, Random
Trees. Forest.

Accuracy, Macro/Micro Averaging of Precision


Accuracy, Precision, Recall, F1 Score.
and Recall.

47. Compare and contrast Enterprise Data Warehouse, Data Mart, and
Virtual Warehouse.

Enterprise Data
Feature Data Mart Virtual Warehouse
Warehouse (EDW)

A subset of a data
A large, centralized data A logical data warehouse
warehouse focused
Definition warehouse for an entire using virtual views, without
on a specific business
organization. physical storage.
area.

Departmental or Real-time access to


Scope Organization-wide data.
functional area. operational data.

Size & No physical storage,


Large and complex. Smaller and simpler.
Complexity lightweight.

Easy to set up, but


High cost and time- Faster and cost-
Implementation performance depends on
consuming. effective.
source systems.
48. Differentiate Operational Database Systems and Data Warehouses.

Operational Database (OLTP) Data Warehouse (OLAP)

Manages real-time transactional data. Stores historical data for analysis.

Current, detailed, and normalized Data Historical, aggregated, and denormalized Data
Type. Type.

Transactional Processing (INSERT, UPDATE,


Analytical Processing (READ, QUERY, REPORTING).
DELETE).

Fast read/write for daily operations. Optimized for complex queries and reporting.

Used by Operational staff (e.g., clerks). Business analysts, managers.

e.g., Banking system, E-commerce


Sales trend analysis, BI dashboards.
transactions.

49. Comparison of FP-Growth Algorithm to Apriori Algorithm

Apriori Algorithm FP-Growth Algorithm

Uses a level-wise search and candidate


Uses a tree structure to compress data.
generation.

Slower due to multiple database scans. Faster with fewer database scans.

Requires more memory for storing More memory-efficient by using the FP-
candidate itemsets. tree.

Higher time complexity, especially with Lower time complexity by avoiding


large datasets. candidate generation.

Simpler to implement but less efficient More complex but more scalable and
overall. efficient.
50. Differentiation K-Medoids Clustering (PAM) and K-Means

K-Means Clustering K-Medoids Clustering (PAM)

Uses the mean of data points as centroid Uses actual data points(medoids) as center.

Sensitive to outliers, as outliers affect the More robust to outliers since it uses
mean. medoids.

Generally faster, especially with large Slower, may require more computation due
datasets. to medoid selection.

Better for non-spherical shapes and different


Works well with spherical clusters.
distributions.

Can use various distance metrics, including


Typically uses Euclidean distance.
Manhattan.

51. Differentiate Partitioning Method and Hierarchical Method


Partitioning Method Hierarchical Method

Divides data into k separate clusters Builds a tree-like structure of clusters

Does not need the number of clusters in


Requires the number of clusters (k) as input
advance

Forms clusters step-by-step by merging or


Creates flat, non-overlapping clusters
splitting

Examples: Agglomerative (bottom-up), Divisive


Example: K-Means
(top-down)

Produces a fixed number of clusters Produces a dendrogram showing cluster levels

Usually faster Slower, especially for large datasets

Best for large, simple datasets Better for small to medium datasets

You might also like