6th_SEM Data Science Notes
6th_SEM Data Science Notes
Types of Architectures:
1. Single-Tier Architecture
o Rarely used.
o Combines all functions in one layer; not scalable.
2. Two-Tier Architecture
o Separates data sources from the data warehouse.
o Limited scalability and performance.
3. Three-Tier Architecture (Most Common)
o Bottom Tier: Data sources + ETL tools
o Middle Tier: Data warehouse + OLAP engine
o Top Tier: Front-end tools for data access and analysis
Authentication Checks who are trying to access the system (e.g., login ID & password).
Gives permission to users based on their role (e.g., analyst can view,
Authorization
admin can edit).
Audit Logs Keeps records of who accessed or changed the data, and when.
5. OLAP & it’s Types.
OLAP (Online Analytical Processing) is a technology that allows users to
analyze large amounts of data quickly from different angles. It helps in
complex data analysis, like summarizing and exploring data for business
insights.
Types of OLAP:
1. MOLAP (Multidimensional OLAP): Uses specialized storage to
organize data in a cube format, making analysis fast. Good for
complex calculations.
2. ROLAP (Relational OLAP): Uses standard relational databases,
flexible and handles large data volumes, but might be slower.
3. HOLAP (Hybrid OLAP): Combines both MOLAP and ROLAP, offering
fast analysis and handling big data efficiently.
Date Sales
Jan 1 ₹1000
Jan 2 ₹1200
Jan 3 ₹1100
Monthly Aggregate:
January Sales = ₹3300 (sum of Jan 1, Jan 2, Jan 3…)
1. Top-Down Approach in DW
• Proposed by Bill Inmon (Father of Data Warehousing)
• Build the main data warehouse first (central storage), then create
data marts (smaller parts) for specific departments like sales, HR,
finance, etc.
Features:
• Focus on the entire organization
• Data is cleaned and integrated in the warehouse
• Good for long-term, large systems
Example:
Build the full warehouse, then create small sections for sales reports or
customer info.
2. Bottom-Up Approach in DW
• Proposed by Ralph Kimball
• Start by building data marts first for each department, then
combine them to form the complete data warehouse.
Features:
• Faster to implement
• Good for quick business needs
• Easier to manage smaller sections
Example:
First create a sales data mart, then later combine with HR and finance
marts to form the full warehouse.
Algorithm Description
ID3 (Iterative Dichotomiser 3) Uses information gain to split data. Simple and fast.
CART (Classification and Uses Gini Index. Can be used for both classification
Regression Tree) and regression tasks.
Gradient Boosted Trees (GBM, Builds trees one after another, each improving the
XGBoost, etc.) previous. Very powerful and accurate.
• Very easy to understand and explain • Can overfit (too complex for small data)
• Works with both numbers and categories • Unstable — small data changes can
change the tree
• Fast to build and use • Can be biased toward features with more
categories
• Can handle missing values
UNIT 3: ASSOCIATION AND CORRELATION ANALYSIS
18. Basic Concepts of Association Rule Learning.
1. Items: The individual products or features in the dataset.
Example: Milk, bread, and eggs.
2. Itemsets: Groups of items that appear together in transactions.
Example: {Milk, Bread} is an itemset.
3. Support: Support shows how often an itemset appears in all
transactions. It tells us the popularity of item combinations.
Example: If out of 100 transactions, 20 include both Milk and
Bread, the support for {Milk, Bread} is 20%.
This helps identify common item combinations.
4. Confidence: Confidence measures the probability that a customer
who buys one item will buy another item as well.
Example: If 80% of customers who buy Bread also buy Milk, the
confidence of {Bread} → {Milk} is 80%.
It shows the strength of the rule "If A, then B".
5. Lift: Lift compares how often two items occur together versus if
they were independent. A lift > 1 means they occur together more
often than by chance.
Example: If the lift for {Bread} and {Milk} is 1.5, it means buying
Bread increases the chance of buying Milk by 50% compared to
random chance.
Lift helps understand if items truly influence each other..
20. Explain the apriori Algorithm of association rule mining and Its Steps
& provide example. State the apriory properties
The Apriori Algorithm is a simple method in data mining to find
common groups of items in large datasets. It helps identify patterns, like
which products are often bought together, for example, bread and butter.
Here are its steps:
Apriori Algorithm: Step-by-Step
Step 1: Find the frequent 1-itemsets
➢ Count the support (occurrence) of each individual item in all
transactions.
➢ Keep only those items whose support ≥ minimum support
threshold.
Step 2: Generate candidate k-itemsets (k > 1)
➢ Use the frequent (k-1)-itemsets to generate new candidate k-
itemsets by joining pairs that share (k-2) items.
Step 3: Prune candidate k-itemsets
➢ Remove candidate k-itemsets if any of their (k-1)-subsets are not
frequent (based on the Apriori property).
Step 4: Count support of candidate k-itemsets
➢ Scan the dataset and count how many transactions contain each
candidate.
➢ Keep only those with support ≥ minimum support.
Step 5: Repeat
➢ Repeat steps 2-4 for larger k until no new frequent itemsets are
found.
Example:
• Transactions: {Bread, Butter}, {Bread}, {Butter, Milk}, {Bread, Milk}.
• Minimum support = 50%.
1. 1-itemsets:
{Bread} → 3/4 = 75% (keep), {Butter} → 2/4 = 50% (keep),
{Milk} → 2/4 = 50% (keep).
2. 2-itemsets:
Check pairs: {Bread, Butter} → 1/4 = 25% (discard),
{Bread, Milk} →2/4 = 50% (keep), {Butter, Milk} → 1/4 = 25% (discard).
3. Results:
o Frequent itemsets: {Bread}, {Butter}, {Milk}, and {Bread, Milk}.
Apriori Properties:
1. Anti-monotonicity:
o If an itemset is frequent, all its subsets are also frequent.
o If an itemset is infrequent, any superset of it will also be
infrequent.
2. Downward Closure Property:
o Used to reduce the search space in frequent itemset mining.
o Helps in pruning the itemsets that are unlikely to be frequent.
In simple terms:
• Big itemsets can only be frequent if all their smaller parts are
frequent.
• This property makes algorithms like Apriori efficient for finding
frequent itemsets.
Applications (short):
1. Market basket analysis (e.g., "If buys bread, likely to buy butter").
2. Recommender systems (e.g., Netflix, Amazon).
3. Customer purchase pattern analysis.
4. Website clickstream analysis.
5. Intrusion/fraud detection.
6. Medical diagnosis (e.g., symptom-disease associations).
3. Density-Based Forms clusters based on areas of high density in data. Can find clusters
Clustering of any shape (e.g., DBSCAN).
Divides the data space into a grid and then forms clusters based on grid
4. Grid-Based Clustering
cells (e.g., STING).
Assumes a model for each cluster and fits the data accordingly (e.g.,
5. Model-Based Clustering
Expectation-Maximization using Gaussian models).
30. What Techniques Are Used for Outlier Detection and Analysis in
Datasets?
Outlier detection identifies data points that significantly differ from
others. Common techniques include:
1. Statistical Methods:
o Use statistical tests (e.g., z-scores) to find points that fall
beyond a certain threshold (e.g., 3 standard deviations from
the mean).
2. Distance-Based Methods:
o Measure the distance of each point from others. Points far
away (e.g., using k-nearest neighbors) can be considered
outliers.
3. Clustering-Based Methods:
o Cluster data and classify points that do not belong to any
cluster or are far from cluster centroids as outliers.
4. Isolation Forest:
o A machine learning method that creates random partitions to
isolate outliers; it’s effective for high-dimensional data.
5. LOF (Local Outlier Factor):
o Measures the local density of data points, flagging those that
have significantly lower density compared to their neighbors.
UNIT 5: CLASSIFICATION
31. Define Supervised Learning and the Classification Technique
Supervised Learning is a type of machine learning where a model is
trained on a labeled dataset, meaning each training example includes
both the input data and the correct output (label). The goal is to learn a
mapping from inputs to outputs so the model can make predictions on
new, unseen data.
Classification Technique: Classification is a specific type of supervised
learning used to categorize input data into predefined classes or labels.
The model learns from labeled data and then predicts the class for new
data points.
• Example: Classifying emails as "spam" or "not spam" based on
features like keywords and sender information.
32. Discuss the Issues Related to Classification in Data Mining
Classification in data mining faces several challenges:
1. Imbalanced Datasets:
o When one class has significantly more samples than others,
leading to biased predictions.
o Example: In fraud detection, there are many legitimate
transactions and few fraud cases.
2. Overfitting:
o When the model learns the training data too well, capturing
noise instead of the underlying pattern, resulting in poor
performance on new data.
3. Underfitting:
o Occurs when the model is too simple to capture the
underlying trends in the data, leading to low accuracy.
4. Feature Selection:
o Choosing the right features is crucial. Irrelevant or redundant
features can degrade model performance.
5. Noise in Data:
o Inaccurate or noisy data can mislead the model during
training, resulting in incorrect classifications.
34. Explain Classification, Bayesian Classification & it’s features and the
Naïve Bayes Algorithm.
Classification is a data mining technique used to predict categories or
classes for data based on past information. Example:
• Email: Spam or Not Spam
• Customer: Will Buy or Not Buy
• Patient: Sick or Healthy
Bayesian Classification is a method based on Bayes' Theorem, which uses
probability to classify data. It calculates the likelihood of each class given the
input features and assigns the class with the highest probability.
Feature Explanation
Simple & Fast Easy to build and works well with large data sets
Works with Text Data Commonly used for spam detection, sentiment analysis
Handles Missing Values Can still predict even if some data is missing
Assumes Independence Assumes features don’t affect each other (this is the “naive” part)
38. Discuss How the Web Page Layout Structure Can Be Mined
Mining the web page layout structure involves analyzing how information
is organized on a webpage to extract meaningful patterns and insights.
Here’s how it can be done:
1. HTML Structure Analysis: Examine the HTML tags and their
arrangement to understand the page layout.
o Example: Identifying headers, footers, sidebars, and main
content areas.
2. Element Extraction: Use parsers to extract specific elements like
text, images, and links based on their tags.
o Example: Extracting all image tags (<img>) to analyze visual
content.
3. Layout Patterns: Identify common layout patterns across different
web pages.
o Example: Determining if most product pages follow a specific
design template.
4. User Interaction Study: Analyze how users interact with different
layout components.
o Example: Observing which sections of the page users click on
most frequently.
5. Dynamic Content Identification: Detect areas that change
frequently (like news sections) versus static content.
o Example: Marking areas that update daily versus those that
remain unchanged.
Section Description
Top part of the page. Usually contains the logo, site name, and
1. Header
navigation menu (like Home, About, Contact).
The central part where the main content appears — text, images,
3. Main Content Area
articles, videos, etc.
Supervised Learning used for classification Unsupervised Learning used for clustering.(
and regression Customer grouping, pattern discovery)
Finds the nearest neighbors of a data point Divides data into a fixed number of clusters
to make a prediction. based on similarity.
Looks at 'k' nearest labeled neighbors Creates 'k' clusters and assigns data to them
It requires labeled data to predict the It works with unlabeled data to group similar
category of new data. items together.
Sensitive to irrelevant features and requires Sensitive to the initial choice of centroids and
feature scaling. may not work well with non-spherical clusters.
Output Type Known (e.g., class or value) Unknown (e.g., clusters, patterns)
Centralized data stored in one location. Data spread across multiple locations or sites.
Limited by the resources of a single Can scale by adding more nodes or systems into
server. the network.
Centralized environments can expose Data remains local; only patterns are shared,
all data. improving privacy.
Possible bottlenecks due to centralized Parallel processing across multiple nodes speeds
processing. up analysis.
46. Differentiate Between Binary Classification and Multiclass
Classification
Classifies data into two distinct classes. Classifies data into three or more classes.
Spam vs. Not Spam; Pass vs. Fail. Classification of images as Cat, Dog, or Bird.
Produces one of two possible outcomes Produces one of several possible outcomes.
Uses Logistic Regression, SVM, Decision Softmax Regression, Multi-class SVM, Random
Trees. Forest.
47. Compare and contrast Enterprise Data Warehouse, Data Mart, and
Virtual Warehouse.
Enterprise Data
Feature Data Mart Virtual Warehouse
Warehouse (EDW)
A subset of a data
A large, centralized data A logical data warehouse
warehouse focused
Definition warehouse for an entire using virtual views, without
on a specific business
organization. physical storage.
area.
Current, detailed, and normalized Data Historical, aggregated, and denormalized Data
Type. Type.
Fast read/write for daily operations. Optimized for complex queries and reporting.
Slower due to multiple database scans. Faster with fewer database scans.
Requires more memory for storing More memory-efficient by using the FP-
candidate itemsets. tree.
Simpler to implement but less efficient More complex but more scalable and
overall. efficient.
50. Differentiation K-Medoids Clustering (PAM) and K-Means
Uses the mean of data points as centroid Uses actual data points(medoids) as center.
Sensitive to outliers, as outliers affect the More robust to outliers since it uses
mean. medoids.
Generally faster, especially with large Slower, may require more computation due
datasets. to medoid selection.
Best for large, simple datasets Better for small to medium datasets