DATA SCIENCE May - 2019
DATA SCIENCE May - 2019
T1 Bread, Milk
Bread, Diaper,
T2 Beer
Milk, Diaper,
T3 Beer
Bread, Milk,
T4 Diaper
Bread, Milk,
T5 Diaper
Let's assume a minimum support threshold of 40% (0.4).
1. Generate 1-itemsets: Bread, Milk, Diaper, Beer
2. Count support: Bread (4/5), Milk (4/5), Diaper (3/5), Beer (2/5)
3. Prune: Bread, Milk, Diaper are frequent 1-itemsets
4. Generate 2-itemsets: Bread-Milk, Bread-Diaper, Milk-Diaper
5. Count support: Bread-Milk (3/5), Bread-Diaper (2/5), Milk-Diaper (2/5)
6. Prune: Bread-Milk is a frequent 2-itemset
7. Generate 3-itemsets: Bread-Milk-Diaper
8. Count support: Bread-Milk-Diaper (2/5)
9. Prune: Bread-Milk-Diaper is a frequent 3-itemset
The frequent itemsets are: {Bread}, {Milk}, {Diaper}, {Bread, Milk}, {Bread,
Diaper}, {Milk, Diaper}, {Bread, Milk, Diaper}.The Apriori algorithm
efficiently generates frequent itemsets by pruning the search space based on the
anti-monotone property: if an itemset is infrequent, all its supersets will also be
infrequent.
Dependent
Variables One One Two or more
Independe
nt
Variables One Two or more One or more
Analyzing how
Predicting exam Predicting house prices psychological factors
scores based on based on size, location, and influence multiple
Example study hours age academic outcomes
In conclusion, the choice of regression technique depends on the research
question and the nature of the data. Simple regression is suitable for
straightforward relationships, multiple regression is ideal for assessing the
impact of multiple predictors on a single outcome, and multivariate regression
is used when analyzing the influence of predictors on multiple outcomes
simultaneously.
Q.4. Attempt the following questions
A) Suppose that you are to allocate a number of automatic teller machines
(ATMs) in a given region so as to satisfy a number of constraints.
Households or workplaces may be clustered so that typically one ATM is
assigned per cluster. The clustering , however may be constrained by two
factors
i. Obstacle (i.e. bridges, rivers , highways that can affect
ATM accessibility)
ii. Additional user specified constraints such as that each
ATM should serve at least 10,000 households.
How can a clustering algorithm such as k-means be modified for quality
clustering under both constraints.
o address the problem of allocating ATMs to clusters of households or
workplaces while considering both obstacles and user-specified constraints, you
can adapt the k-means clustering algorithm. Here’s a step-by-step approach to
modify k-means to accommodate these constraints:
1. Define the Problem and Constraints
i. Obstacles: Geographic barriers like bridges, rivers, and highways that affect
accessibility. This means that the algorithm needs to account for the
connectivity between clusters.
ii. Minimum Service Requirement: Each cluster (or ATM location) must
serve at least a specified number of households (e.g., 10,000).
2. Preprocessing for Obstacles
Graph Representation: Convert the geographical area into a graph
where nodes represent potential cluster centers (e.g., areas with high
density of households), and edges represent possible connections between
nodes that are not obstructed by obstacles.
Graph Weighting: Assign weights to edges based on the ease of
accessibility. For example, direct routes might have lower weights
compared to routes with obstacles.
3. Modified k-Means Algorithm
Initialization
1. Cluster Initialization: Initialize the k-means algorithm by choosing k
potential cluster centers. Use a method that considers both the density of
households and connectivity in the graph. For example, you could use a
weighted density measure where areas with higher household densities
and fewer obstacles are favored.
Assignment Step
2. Distance Calculation: Instead of using just Euclidean distance, compute
a modified distance metric that includes the cost of accessing each
location. This could be a combination of:
o Euclidean distance (or actual travel distance) between households
and ATM locations.
o Accessibility cost based on obstacles (e.g., additional travel time or
difficulty).
3. Cluster Assignment: Assign each household or workplace to the nearest
cluster center based on this modified distance metric.
Update Step
4. Recalculation of Centroids: Update the cluster centers by recalculating
the centroid based on the assigned points. Ensure that the new centroids
still satisfy the minimum service requirement.
5. Constraint Enforcement: After recalculating the centroids, verify that
each cluster serves at least the minimum number of households. If a
cluster does not meet this requirement, adjust the clustering:
o Reassign Households: Move some households to neighboring
clusters if it helps meet the service requirement.
o Reposition Centroids: Adjust centroids to better balance
household distribution.
4. Post-Processing
6. Connectivity Check: Verify that the clusters are connected considering
the obstacles. If any cluster is isolated due to obstacles, it may need to be
merged with nearby clusters or adjusted to ensure that all households can
reach the ATM.
7. Optimization: Use an iterative approach to refine the clusters. This might
involve adjusting the number of clusters or re-running the algorithm with
updated constraints to improve the solution.
5. Heuristic Adjustments
8. Heuristic Methods: Implement heuristic or metaheuristic methods such
as simulated annealing or genetic algorithms to fine-tune the clustering,
especially when exact solutions are computationally infeasible.
Summary
Graph-Based Preprocessing: Represent the geographical area as a graph
and account for obstacles.
Modified Distance Metric: Incorporate accessibility costs in the distance
calculation.
Constraint Handling: Ensure each cluster meets the minimum service
requirement and is connected.
Iterative Refinement: Continuously adjust clusters and constraints to
improve the solution.
By integrating these steps, the k-means algorithm can be effectively modified to
handle the specific constraints of ATM allocation, ensuring both accessibility
and service requirements are met.
B) Define correlation. Given two variables X and Y, define and explain
formula for correlation coefficient ‘r’.
if X= {2,4,6,8,10} and if X=Y, then r = ?
if Y= {1,3,5,7,9} , r=1 and if Y= {9,7,5,3,1} then r = ?
Q.5. Attempt following questions
A) Demonstrate use of following plotting systems in R with their
constraints if any
i. base graphics
ii. lattice
iii. ggplot2
Answer
Demonstration of Plotting Systems in R
In R, various plotting systems are available for visualizing data. The three
primary systems are base graphics, lattice, and ggplot2. Each system has its own
strengths and constraints.
i. Base Graphics
Description: Base graphics is the default plotting system in R. It provides a
simple and straightforward way to create a variety of plots using functions
like plot(), hist(), and boxplot().Example:
text
# Sample data
x <- rnorm(100)
y <- rnorm(100)
# Sample data
data(iris)
auto.key = TRUE)
Constraints:
The syntax can be less intuitive for users familiar with base R graphics.
Customizing plots can be less flexible compared to ggplot2.
Lattice plots are not as easily combined with other plot types as ggplot2.
iii. ggplot2
Description: ggplot2 is a powerful and flexible plotting system based on the
grammar of graphics. It allows for layered visualizations and extensive
customization options, making it a popular choice for data visualization in
R.Example:
text
# Load the ggplot2 package
library(ggplot2)
# Sample data
data(mpg)
geom_point() +
theme_minimal()
Constraints:
The learning curve can be steep for beginners due to its layered approach and
syntax.
For very large datasets, ggplot2 may be slower compared to base graphics.
Some complex visualizations may require additional packages or custom
functions.
Summary
Base Graphics: Simple and straightforward but limited in customization and
complexity.
Lattice: Great for multi-panel plots and visualizing multivariate data but less
intuitive and flexible than ggplot2.
ggplot2: Highly customizable and powerful, suitable for complex
visualizations, but has a steeper learning curve.
Each of these plotting systems has its unique advantages and constraints, and
the choice of which to use often depends on the specific requirements of the
analysis and the user's familiarity with R.
Graphics",
# Sample data
data <- data.frame(categories = c("A", "B", "C", "D"), values = c(10, 15, 7,
20))
# Sample data
data <- data.frame(categories = c("A", "B", "C", "D"), values = c(10, 15, 7,
20))
geom_bar(stat = "identity") +
theme_minimal()
Constraints:
The learning curve can be steep for beginners due to its layered approach.
For very large datasets, ggplot2 may be slower compared to base graphics.
Summary
Base Graphics: Simple and straightforward for creating basic bar charts but
limited in customization.
Lattice: Good for creating multi-panel plots, but less intuitive for beginners and
less flexible.
ggplot2: Highly customizable and powerful for complex visualizations, but
requires a steeper learning curve.
Each of these plotting systems provides a way to create bar charts in R, and the
choice of which to use often depends on the specific requirements of the
analysis and the user's familiarity with R.