PA 5 UNIT
PA 5 UNIT
Example
Imagine you’re deciding whether to pack an umbrella. You ask 100
friends (decision trees) for their opinion. Some might say "yes," and
others "no," based on factors like the weather forecast, humidity, and
wind. Random Forest takes a majority vote to decide whether you
should pack the umbrella.
Steps in Random Forest Classification
1. Build Trees:
o Random subsets of the data and features are used to grow
each tree.
2. Make Predictions:
o For a new data point, each tree in the forest predicts a
class.
3. Combine Predictions:
o The class with the most votes becomes the final
prediction.
4. Briefly explain Principal Component Analysis (PCA) and its
importance?
Purpose of PCA
The main goal of PCA is to reduce the number of variables
(dimensions) in a dataset while retaining the essential patterns and
trends.
This is particularly useful when dealing with high-dimensional data,
making it easier to visualize and analyze.
Importance of PCA
Reduces Complexity: By decreasing the number of dimensions,
PCA makes it easier to analyze and visualize data without losing
important information.
Improves Performance: Reducing dimensionality can enhance
the performance of machine learning algorithms by decreasing
computation time and avoiding overfitting.
Enhances Visualization: PCA allows high-dimensional data to be
visualized in 2D or 3D, making patterns and relationships easier
to identify.
Handles Multicollinearity: It effectively addresses issues related
to multicollinearity (when features are highly correlated),
providing independent components for analysis.
5. Define the terms "Support," "Confidence," and "Lift" in
Association Rule Mining.
Advantages
High Accuracy: Random Forests typically provide very accurate
predictions because they combine the results of multiple
decision trees. This ensemble approach reduces the likelihood
of errors that a single tree might make, leading to better overall
performance, especially on complex datasets.
Robustness to Overfitting: Unlike individual decision trees,
which can easily overfit the training data (meaning they
perform well on training data but poorly on new data), Random
Forests are less prone to this issue. The averaging of predictions
from many trees helps generalize better to unseen data, making
them reliable for various applications.
Feature Importance: Random Forests automatically assess the
importance of different features in making predictions. This
means they can help identify which variables are most
influential in the classification process, aiding in feature
selection and improving model interpretability.
Disadvantages
Computationally Intensive: Training a Random Forest model
can be resource-intensive, especially with large datasets and
many trees. This can lead to longer training times and require
more computational power compared to simpler models like
single decision trees.
Limited Interpretability: While Random Forests provide insights
into feature importance, they are generally harder to interpret
than single decision trees. Understanding why a specific
prediction was made can be challenging, which may be an issue
in fields where explainability is crucial, such as healthcare or
finance.
Slower Prediction Times: Making predictions with Random
Forests can be slower than with simpler models because each
input must pass through multiple trees before arriving at a final
decision. This can be a drawback in real-time applications
where quick responses are necessary.
9. Explain how cluster analysis can be applied in customer
segmentation?
Example Application
A common application of the Apriori algorithm is in retail for market
basket analysis. For example, consider a grocery store analyzing
customer transactions:
Suppose the store has transaction data showing that customers
frequently buy bread, butter, and jam together.
Using the Apriori algorithm, the store identifies that the
frequent itemset {bread, butter} appears in 40 out of 100
transactions.
The support for this itemset would be 40/100= 0.4 or 40%.
If it finds that when customers buy bread (which appears in 60
transactions), they also tend to buy butter in 40 of those cases,
then the confidence for the rule "If bread, then butter" would
be 40/60=0.67 or 67%.
The store can also calculate lift to determine how strong this
association is compared to random purchases.
By applying these insights, retailers can create targeted marketing
strategies, optimize product placements, and improve inventory
management based on customer buying patterns.
5.Compare and contrast Unsupervised Learning methods (such as
clustering) with Supervised Learning. Discuss how methods like
Random Forests can also be used for regression tasks.