0% found this document useful (0 votes)
21 views11 pages

Ds 5 Marks Final

The document discusses various statistical concepts including outliers, Exploratory Data Analysis (EDA), probability, and data types. It emphasizes the importance of identifying outliers, the role of EDA in understanding data, and compares different methods for detecting outliers. Additionally, it covers data preprocessing, machine learning, and the significance of statistical tests such as the one-sample t-test, along with the impact of imputation techniques on model performance.

Uploaded by

recab18728
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views11 pages

Ds 5 Marks Final

The document discusses various statistical concepts including outliers, Exploratory Data Analysis (EDA), probability, and data types. It emphasizes the importance of identifying outliers, the role of EDA in understanding data, and compares different methods for detecting outliers. Additionally, it covers data preprocessing, machine learning, and the significance of statistical tests such as the one-sample t-test, along with the impact of imputation techniques on model performance.

Uploaded by

recab18728
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 11

Module -1

1. Define an outlier. Explain its significance in statistics with an example.


Definition:
An outlier is a data point that is very different from other observations in a dataset. It lies far away from the mean or typical
value. Outliers can be unusually high or low and often indicate errors, variability, or important discoveries. Identifying outliers
is key for accurate analysis.
Significance and Explanation:
1. Outliers may indicate mistakes in data entry or measurement (e.g., a person’s age listed as 200).
2. One extreme value can change the average and variability in the dataset.
3. Many models assume normal data distribution; outliers can lead to incorrect predictions.
4. In medical or scientific data, outliers may indicate a new discovery or rare condition.
5. Helps in deciding which data to remove or correct.
6. Outliers can skew test results and lead to false conclusions.
7. In finance, outliers can show big losses or profits, which are important for risk management.
Example:
In a dataset of students’ test scores: [75, 78, 80, 82, 85, 90, 100, 2], the score “2” is an outlier. It might be a typo or a real case of
someone who didn't attend the test. Identifying it is important to make accurate decisions.

2. Break down the concept of Exploratory Data Analysis (EDA) and illustrate its significance in statistics.
Definition:
Exploratory Data Analysis (EDA) is the first step in data analysis where we explore data using charts and statistics. It helps in
understanding data patterns, spotting errors, and forming hypotheses. EDA uses visual tools like histograms and box plots.
Importance:
1. Detects Patterns: Shows trends, clusters, or relationships in data.
2. Finds Outliers: Visual tools easily highlight unusual values.
3. Tests Assumptions: Helps check if data meets conditions for modeling.
4. Summarizes Data: Gives summary statistics (mean, median, etc.).
5. Guides Model Selection: Suggests suitable statistical techniques.
6. Prepares Data: Identifies missing or incorrect data.
7. Improves Insight: Offers a deeper, intuitive feel of the data.
Graphical Tools: Histograms, box plots, scatter plots
Statistical Tools: Mean, median, variance, skewness

3. Analyze how outliers can be identified in a dataset by comparing Z-score and IQR methods.
Definition:
Outliers are extreme values that differ greatly from others. Two common methods to detect them are the Z-score method
and the IQR method.
Comparison Table:
Feature Z-Score Method IQR Method
Basis Standard deviation Spread between Q1 and Q3 (middle 50%)
Formula Z = (X - Mean)/SD Outlier if value < Q1 - 1.5×IQR or > Q3 + 1.5×IQR
Threshold Z > 3 or Z < -3 Outside Q1 − 1.5×IQR or Q3 + 1.5×IQR
Suitable for Normally distributed data Non-normal or skewed data
Easy to interpret? Yes (if mean and SD are known) Yes (uses percentiles)
Sensitive to outliers? Yes Less sensitive
Visual support Z-scores, bell curve Box plots
Conclusion:
Use Z-score for normal data, and IQR for skewed data.

4. Find the probability of throwing two fair dice when the sum is 5 and when the sum is 8.
Definition:
Probability is the chance of an event happening, calculated by dividing the number of favorable outcomes by total outcomes.
Total outcomes when rolling 2 dice = 6 × 6 = 36
Sum = 5:
Favorable pairs: (1,4), (2,3), (3,2), (4,1) → 4 outcomes
P(Sum = 5) = 4 / 36 = 1 / 9
Sum = 8:
Favorable pairs: (2,6), (3,5), (4,4), (5,3), (6,2) → 5 outcomes
P(Sum = 8) = 5 / 36
Final Answer:
• Probability (Sum = 5) = 1/9
• Probability (Sum = 8) = 5/36
5. Compare quantitative data and qualitative data.
Definition:
Quantitative data is numerical and represents measurable values (e.g., height).Qualitative data describes categories or qualities
Comparison Table:
Feature Quantitative Data Qualitative Data
Type Numeric Categorical
Example Age = 25, Weight = 60 kg Gender = Male, Color = Red
Subtypes Discrete, Continuous Nominal, Ordinal
Analysis Mean, Median, Range Mode, Frequency
Graphs Used Histogram, Boxplot Bar chart, Pie chart
Mathematical Ops Possible (add, subtract, etc.) Not possible
Use in Stats Used in regression, correlation Used in classification, grouping

6. Define covariance. Analyze how it helps in understanding the relationship between two variables.
Definition:
Covariance measures how two variables change together. A positive covariance means both increase together, while negative
means one increases as the other decreases.
Importance:
1. Shows Direction: Positive = same direction; Negative = opposite direction.
2. Measures Relationship: How one variable affects the other.
3. Supports Correlation: Covariance is used to calculate correlation.
4. Used in Finance: Helps in portfolio management to reduce risk.
5. Identifies Trends: Reveals linked behaviors (e.g., study time & scores).
6. Basis for PCA: Important in dimensionality reduction techniques.
7. Helps in Multivariate Analysis: Understands variable interaction.
Example:
If X = hours studied and Y = marks scored, high covariance shows a strong study-performance link.

7. A distribution is skewed to the right and has a median of 20. Will the mean be greater than or less than 20? Explain.
Definition:
In a right-skewed distribution, most values are on the left, with a long tail on the right. The mean is pulled in the direction of the
skew.
Explanation:
1. In right-skewed data: Mean > Median
2. The median (middle value) = 20
3. Since the tail is on the right, some high values exist.
4. These high values increase the average (mean).
5. So, the mean will be greater than 20.
6. The mode < median < mean in right-skewed data.
7. This affects statistical measures and modeling.
Conclusion:
Mean > 20 because of the skew towards higher values.

8. Define one-sample t-test. Explain when it is used in statistical analysis.


Definition:
A one-sample t-test checks if the mean of a sample is significantly different from a known or assumed population mean. It is
used when population standard deviation is unknown.
When used:
1. When comparing sample mean to a known population mean.
2. Sample size is small (usually < 30).
3. Population standard deviation is unknown.
4. Data should be approximately normally distributed.
5. Used in quality control (e.g., is average bottle fill = 500ml?).
6. Helps in hypothesis testing.
7. Useful in medical trials, education studies, etc.
Formula:
t = (x̄ − μ) / (s / √n)
Where x̄ = sample mean, μ = population mean, s = sample SD, n = sample size
9. Apply your understanding of outliers to identify situations where retaining them is appropriate.
Definition:
Outliers are extreme values that lie outside the general trend of data. Sometimes they are not errors but important information.
When to retain outliers:
1. When they are genuine: Real-life rare but valid cases (e.g., record-breaking sports data).
2. In risk analysis: Outliers may show rare but risky situations (e.g., financial losses).
3. Medical Research: Rare side effects or symptoms can be crucial.
4. Business Insights: Unique customer behavior may uncover new opportunities.
5. Scientific Discoveries: May point to new theories or phenomena.
6. No error is found: If the data is correct, there's no reason to remove it.
7. Modeling needs: Some models need full range of variation, including extremes.
Example:
A temperature reading of 50°C in a city may be rare but real during a heatwave.

10. List key properties of a normal distribution. Analyze any two.


Definition:
A normal distribution is a symmetric bell-shaped curve where most values are near the mean. It is used in many statistical
models and tests.
Key Properties:
1. Symmetrical shape.
2. Mean = Median = Mode.
3. Bell-shaped curve.
4. Total area under curve = 1.
5. 68%-95%-99.7% rule.
6. Defined by mean (μ) and standard deviation (σ).
7. Tails approach but never touch the x-axis.
Analysis of Two Properties:
• Symmetry: Equal distribution on both sides of the mean, important for balanced datasets.
• 68%-95%-99.7% Rule: Helps in confidence intervals and understanding data spread.

11. Explain different stages of Data Science.


Definition:
Data science involves extracting knowledge from data using statistics, programming, and domain knowledge. It follows a
structured process to solve real-world problems.
Stages:
1. Problem Definition: Understand the business or research question.
2. Data Collection: Gather data from sources (databases, APIs, sensors).
3. Data Cleaning: Remove errors, duplicates, and missing values.
4. EDA (Exploration): Understand data using visuals and statistics.
5. Feature Engineering: Select and create variables for modeling.
6. Modeling: Apply algorithms to build predictive or descriptive models.
7. Evaluation: Test model accuracy and performance.
8. Deployment: Use the model in real systems.
9. Monitoring & Updating: Track model performance over time.

12. What is machine learning? Justify its role and importance in data science.
Definition:
Machine Learning (ML) is a branch of AI where computers learn from data to make decisions or predictions without being
explicitly programmed.
Importance in Data Science:
1. Automates Decision Making: Learns patterns and predicts future outcomes.
2. Handles Large Data: Works well with big datasets.
3. Used in Various Fields: Health, finance, marketing, etc.
4. Improves Accuracy: Often better than manual rules.
5. Learns Continuously: Models improve with new data.
6. Enables Personalization: E.g., Netflix recommendations.
7. Powers AI Applications: Like chatbots, voice assistants.
Example:
A spam filter uses machine learning to detect junk emails based on past patterns.
13. What is data preprocessing? Explain its role and common methods.
Definition:
Data preprocessing is the step of preparing raw data for analysis. It includes cleaning, transforming, and organizing data to
make it suitable for models.
Role in Data Analysis:
1. Removes noise and errors.
2. Handles missing or duplicate values.
3. Normalizes/standardizes data for algorithms.
4. Converts data types and encodes categories.
5. Improves model performance and accuracy.
6. Makes data compatible with ML algorithms.
7. Saves time and ensures consistency.
Common Methods:
• Data Cleaning: Remove/fix incorrect values.
• Normalization/Scaling: Adjust ranges (0 to 1).
• Encoding Categorical Data: Use one-hot or label encoding.
• Missing Value Treatment: Impute or drop.
• Data Transformation: Log, square root, etc.

Module 3
1. Evaluate the effectiveness of the median over the mean in specific data scenarios.
Definition:
The median is the middle value of an ordered dataset. The mean is the arithmetic average. The median is more effective than
the mean when data contains outliers or is skewed, as it is not affected by extreme values.
Why median is better in some cases:
1. Unaffected by outliers: Median ignores extreme values.
2. Better for skewed data: Accurately represents center.
3. More realistic average: In income data, median gives a fairer picture.
4. Preferred in small datasets: Especially when uneven values exist.
5. Useful in ordinal data: Like survey ranks (e.g., poor, fair, good).
6. More robust: Not easily distorted by extreme values.
7. Easier interpretation in real-world contexts.
Example:
Salaries: [20k, 25k, 27k, 30k, 1 million]
• Mean = 220.4k (misleading)
• Median = 27k (realistic)

2. Analyze the impact of different imputation techniques on a dataset with missing values by comparing their outcomes.
Definition:
Imputation is filling in missing data to avoid bias or errors in analysis. Different techniques affect the dataset and model
outcomes differently.
Key points:
1. Mean Imputation: Replaces missing values with mean of feature. Simple but can reduce variance.
2. Median Imputation: Uses median; better for skewed data as it’s less affected by outliers.
3. Mode Imputation: Replaces missing categorical data with most frequent value.
4. KNN Imputation: Uses nearest neighbors to estimate missing values; preserves data patterns.
5. Impact on Data: Simple methods may bias data; advanced methods keep distribution.
6. Effect on Models: Better imputation improves model accuracy and reduces bias.
7. Trade-offs: More complex methods require more computation but improve quality.

3. Analyze the effect of mean, median, and KNN imputation on model performance when applied to missing values.
Definition:
Imputation methods affect how models learn and predict. Choosing the right method influences accuracy, bias, and variance.
Comparison:
Imputation Method Effect on Model Performance When to Use
Mean Can bias model if outliers exist When data is symmetric and clean
Median Robust to outliers, better accuracy Skewed data or outlier presence
KNN Best preserves data structure; improves accuracy Complex data with patterns
Example:
Using KNN might improve precision by 5% over mean imputation in a health dataset.
4. Develop a real-world scenario illustrating an alternative hypothesis in hypothesis testing.
Definition:
The alternative hypothesis (H₁) is a statement that shows there is an effect or difference, opposite of the null hypothesis (H₀).
Example scenario:
In a drug trial,
• H₀: The new drug has no effect on blood pressure.
• H₁: The new drug reduces blood pressure significantly.
Justification:
1. The alternative reflects the research goal.
2. Based on preliminary data or theory.
3. It is tested to confirm the drug’s efficacy.
4. Helps decide if observed effects are real or by chance.
5. Supports decision-making in medicine.
6. Requires careful data collection.
7. Provides direction for statistical tests.

5. Critically evaluate different types of sampling bias and their impact on research validity with examples.
Definition:
Sampling bias occurs when sample data does not represent the population, leading to invalid conclusions.
Types of bias and examples:
Bias Type Description Impact Real-Life Example
Selection Bias Non-random sampling Misleading results Survey only urban population
Response Bias Respondents answer untruthfully Inaccurate data Sensitive questions in surveys
Non-response Bias Certain groups don’t respond Missing views Low participation by elderly
Sampling Frame Bias Wrong population frame used Skews data Using phone book for all residents
Impact:
• Reduces validity and reliability.
• Leads to wrong conclusions.
• Affects generalizability.
• Needs mitigation by proper sampling.

6. Critically evaluate degrees of freedom (DF) and its significance in statistical tests with an example.
Definition:
Degrees of freedom (DF) refer to the number of independent values that can vary in an analysis without violating constraints.
Significance:
1. Influences critical values in hypothesis tests.
2. Affects test statistic distributions (e.g., t-distribution).
3. Lower DF leads to wider confidence intervals.
4. Ensures proper estimation of variability.
5. Used in tests like t-test, chi-square.
6. Important for sample size considerations.
7. Example: For a t-test with n=10 samples, DF = 9.
Example:
In a one-sample t-test with 10 data points, DF = 10 - 1 = 9, influencing the t critical value.

7. Analyze normalization vs standardization in data preprocessing.


Definition:
Normalization and standardization are scaling techniques that adjust data values for better model performance.
Feature Normalization Standardization
Method Rescales data to [0,1] range Centers data to mean=0, SD=1
Formula (x−min)/(max−min)(x - min) / (max - min) (x−μ)/σ(x - \mu) / \sigma
Effect Changes range Changes mean and variance
Use case When data is not normally distributed When data is normally distributed
Sensitive to outliers Yes Less sensitive
Application Neural networks, image processing Algorithms assuming normality (e.g., SVM)
Example:
Normalization is good for pixel values; standardization suits height/weight data.
8. Generate and analyze a histogram for a numerical feature in a dataset using Python.
Definition:
A histogram is a graphical representation showing the distribution of numerical data by grouping values into bins.
Steps:
1. Use Python libraries like Matplotlib or Seaborn.
2. Plot feature’s value frequency.
3. Examine shape (symmetry, skewness).
4. Check central tendency (peak location).
5. Assess spread (width of bars).
6. Identify outliers or gaps.
7. Helps decide preprocessing or modeling methods.
Example:
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(data['age'], bins=10, kde=True)


plt.title('Age Distribution')
plt.show()
Interpretation:
If histogram is right-skewed, mean > median; spread indicates variability.

9. Analyze overfitting vs underfitting in machine learning models.


Definition:
• Overfitting: Model learns training data too well, including noise, leading to poor generalization.
• Underfitting: Model is too simple to capture underlying pattern, resulting in poor performance.
Feature Overfitting Underfitting
Model Complexity Too high Too low
Training Accuracy Very high Low
Test Accuracy Low due to poor generalization Low due to weak model
Cause Excessive features, deep trees Too few features, shallow models
Solution Regularization, pruning, more data More features, complex models
Example Memorizing training set Linear model for non-linear data
Impact:
Both reduce model usefulness; balancing bias and variance is crucial.

Module 4

1. Analyze how Logistic Regression is applied to classify binary outcomes. Compare it with Linear Regression.
Definition:
Logistic Regression is a supervised classification algorithm used to predict binary outcomes (0 or 1). It models the probability
of an event occurring using the logistic (sigmoid) function, unlike Linear Regression which predicts continuous values.
Key points:
1. Logistic Regression outputs probabilities between 0 and 1, which are converted to classes using a threshold (usually
0.5).
2. It uses the sigmoid function: σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}.
3. Assumes a linear relationship between features and the log-odds of the outcome.
4. Suitable for classification tasks, unlike Linear Regression which predicts continuous outcomes.
5. Linear Regression can predict values outside [0,1], making it unsuitable for classification.
6. Logistic Regression estimates parameters using Maximum Likelihood Estimation (MLE).
7. Outputs can be interpreted as the likelihood of belonging to a class.
Aspect Logistic Regression Linear Regression
Output Probability (0 to 1) Continuous values
Function Sigmoid function Linear function
Use case Classification (binary) Regression (continuous outcomes)
Error metric Log loss Mean Squared Error
Assumptions Linear in log-odds Linear relationship between X and Y
2. Analyze how varying 'k' affects K-Nearest Neighbors (KNN) accuracy and decision boundaries.
Definition:
KNN classifies a data point based on the majority class among its ‘k’ closest neighbors. The choice of ‘k’ directly affects model
bias and variance.
Key points:
1. Small k (e.g., k=1): Low bias, high variance; model fits training data tightly, prone to noise.
2. Large k: Higher bias, lower variance; smoother decision boundaries but may overlook local patterns.
3. Small k creates complex, jagged boundaries; large k creates simpler, smoother boundaries.
4. Very large k approaches majority class baseline (poor local accuracy).
5. Optimal k balances bias-variance trade-off, often found via cross-validation.
6. Model performance can decrease if k is too small or too large.
7. Choice of k impacts computation cost (larger k means more neighbors checked).

3. Evaluate K-Means clustering, optimal cluster determination methods (Elbow, Silhouette).


Definition:
K-Means partitions data into k clusters by minimizing the variance within each cluster through iterative centroid updates.
Key points:
1. Initial centroids chosen randomly or via heuristics.
2. Assigns points to nearest centroid, then recalculates centroids.
3. Repeats until convergence (no change in clusters).
4. Elbow Method: Plots total within-cluster sum of squares vs k; “elbow” point suggests optimal k.
5. Silhouette Score: Measures how similar an object is to its cluster vs others; closer to 1 means better clustering.
6. Elbow method may be ambiguous; Silhouette provides quantitative validation.
7. Both methods help avoid arbitrary k choice and improve cluster validity.

4. Analyze how SVM separates classes using hyperplanes and kernel functions.
Definition:
Support Vector Machines (SVM) classify data by finding the hyperplane that best separates classes with the maximum margin.
Key points:
1. Finds the hyperplane maximizing margin between closest points (support vectors).
2. Works well for linearly separable data.
3. For non-linear data, kernel functions map data to higher-dimensional space.
4. Common kernels: linear, polynomial, radial basis function (RBF).
5. Kernels allow SVM to find a linear separator in transformed space.
6. SVM is robust to high-dimensional spaces.
7. Regularization controls overfitting by soft margin.

5. Analyze Decision Tree splits based on Information Gain and Gini Impurity.
Definition:
Decision Trees split nodes by selecting features that best separate classes using metrics like Information Gain or Gini Impurity.
Key points:
1. Information Gain: Measures reduction in entropy after split; higher gain preferred.
2. Gini Impurity: Measures probability of misclassification; lower impurity preferred.
3. Choice of split impacts tree accuracy and complexity.
4. Splits that maximize purity improve model predictions.
5. More splits increase tree depth and risk of overfitting.
6. Balanced splits improve generalization.
7. Feature selection at nodes influences interpretability.

6. Evaluate overfitting in Decision Trees and pruning techniques.


Definition:
Overfitting happens when a decision tree models noise in training data, reducing its ability to generalize.
Key points:
1. Overfitting leads to very deep, complex trees.
2. Pruning reduces tree size by removing branches.
3. Pre-pruning: Stops splitting early based on criteria like minimum samples.
4. Post-pruning: Removes branches after full tree is grown.
5. Pruning improves accuracy on unseen data.
6. Reduces variance while increasing bias slightly.
7. Helps avoid memorizing noise and improves model robustness.
7. Analyze the role of distance metrics in KNN and importance of feature scaling.
Definition:
Distance metrics measure similarity between points in KNN, affecting classification/regression outcomes.
Key points:
1. Common metrics: Euclidean, Manhattan, Minkowski.
2. Choice affects neighborhood and thus predictions.
3. Features with larger scales dominate distance if not scaled.
4. Feature scaling (normalization/standardization) equalizes ranges.
5. Without scaling, features with big ranges bias model.
6. Scaling improves accuracy and stability of KNN.
7. Normalization used when data bounded; standardization when normally distributed.

8. Analyze how Logistic Regression models probabilities and effects of decision threshold adjustments.
Definition:
Logistic Regression models the probability of binary outcomes using the sigmoid function. The decision threshold determines
classification.
Key points:
1. Default threshold is 0.5 (probability above means positive class).
2. Adjusting threshold balances sensitivity (recall) and specificity.
3. Lower threshold increases recall but may increase false positives.
4. Higher threshold increases precision but may miss positives.
5. Important for imbalanced datasets (e.g., fraud detection).
6. ROC curve and Precision-Recall curve help select thresholds.
7. Threshold tuning improves model usefulness depending on problem context.

Got it! Here are answers for Module 4, Questions 1 to 8, exactly as you wrote the questions, with your requested format: a clear
4-line definition/theory in easy words, followed by detailed points and tables with only two columns where required, no generic
headings.

1. Analyze how Logistic Regression is applied to classify binary outcomes. Examine how its approach, underlying
assumptions, and output differ from those of Linear Regression.
Logistic Regression is a statistical method used to predict the probability of two possible outcomes (binary classification). It
uses the logistic (sigmoid) function to convert linear combinations of input features into probabilities between 0 and 1. Unlike
Linear Regression, it predicts class probabilities, not continuous values. It assumes a linear relationship between inputs and the
log-odds of the output.
• Predicts probabilities for binary outcomes (e.g., yes/no).
• Uses the sigmoid function to map results between 0 and 1.
• Assumes the log-odds of the outcome are a linear function of inputs.
• Outputs probabilities, unlike Linear Regression’s continuous predictions.
• Suitable for classification, not regression tasks.
• Uses maximum likelihood estimation to find model parameters.
• Can adjust decision thresholds for classification sensitivity.

2. Analyze how varying the value of 'k' influences the accuracy and decision boundaries of the K-Nearest Neighbors
(KNN) algorithm. Examine the consequences of choosing a 'k' that is too small or too large, and how it impacts model
bias, variance, and overall performance.
In KNN, 'k' is the number of neighbors considered to classify a point. A small 'k' results in a model sensitive to noise (high
variance), while a large 'k' smooths decision boundaries (high bias). Selecting the right 'k' balances accuracy by controlling bias-
variance tradeoff.
• Small 'k' (like 1) can cause overfitting and noisy boundaries.
• Large 'k' creates smoother, more generalized boundaries but may underfit.
• Small 'k' has low bias but high variance.
• Large 'k' has high bias but low variance.
• Optimal 'k' depends on data complexity and size.
• Cross-validation helps find the best 'k'.
• Too small or too large 'k' reduces model accuracy.

3. Critically evaluate how the K-Means clustering algorithm partitions data points into clusters. Justify the
methods used to determine the optimal number of clusters, such as the Elbow Method or Silhouette Score, and
assess their effectiveness in different scenarios.
K-Means divides data into k clusters by assigning points to the nearest centroid and updating centroids iteratively. To
choose the best k, methods like Elbow and Silhouette Score are used to balance cluster quality and simplicity.

Method Effectiveness
Elbow Method 1. Visualizes within-cluster variance. 2. Finds “elbow” where adding clusters adds little improvement. 3.
Method Effectiveness
Simple and intuitive. 4. Subjective and sometimes unclear. 5. Works well with well-separated clusters.
Silhouette 1. Gives numeric value for cluster quality. 2. Measures cohesion and separation. 3. More objective than Elbow.
Score 4. Useful for comparing clusterings. 5. Detects overlapping or unclear clusters.

4. Analyze how Support Vector Machines (SVM) separate data points into distinct classes by identifying optimal
hyperplanes. Examine the role of kernel functions in transforming non-linearly separable data into a higher-dimensional
space for effective classification.
SVM separates classes by finding the hyperplane that maximizes margin between them. If data isn’t linearly separable, kernels
transform data into higher dimensions to find a linear separator.
• Finds hyperplane with maximum margin between classes.
• Works well for binary classification problems.
• Maximizes distance between closest points (support vectors).
• Kernels (linear, polynomial, RBF) map data into higher dimensions.
• Helps separate complex, non-linear data.
• Prevents overfitting by maximizing margin.
• Effective in many practical applications.

5. Analyze how Decision Trees determine splits at each node based on feature selection criteria such as Information Gain
or Gini Impurity. Examine how the choice of splitting feature at each node impacts the tree’s accuracy, complexity, and
potential for overfitting.
Decision Trees split nodes using criteria that measure how well a feature separates classes. Information Gain and Gini Impurity
assess the purity improvement. Good splits increase accuracy but too many splits cause overfitting.
• Information Gain measures entropy reduction after a split.
• Gini Impurity measures probability of misclassification.
• Features with highest gain or lowest impurity are chosen.
• Early splits affect overall tree structure heavily.
• More splits improve training accuracy but may overfit.
• Selecting relevant features balances accuracy and complexity.
• Proper splitting reduces bias and variance.

6. Critically evaluate the problem of overfitting in Decision Trees and its impact on model generalization. Justify how
pruning techniques, such as pre-pruning and post-pruning, can effectively reduce overfitting and improve the model’s
performance on unseen data.
Overfitting occurs when the tree learns noise and specific patterns in training data, harming performance on new data. Pruning
removes unnecessary branches to reduce overfitting. Pre-pruning stops growth early; post-pruning trims grown trees.
Pre-pruning Post-pruning
1. Stops tree growth early 1. Grows full tree before pruning
2. Decides not to split if criteria fail 2. Removes branches after full growth
3. Faster, less complex trees 3. Simplifies complex trees
4. May miss important splits 4. Usually more accurate
5. Controls overfitting upfront 5. Requires additional validation

7. Analyze the role of distance metrics in the K-Nearest Neighbors (KNN) algorithm and how they influence the
classification or regression outcomes. Examine why feature scaling through normalization or standardization is crucial
when using KNN, particularly in datasets with varying feature ranges.
Distance metrics (like Euclidean) measure closeness of points in KNN. If features vary widely in scale, unscaled features
dominate distance calculations. Scaling ensures each feature contributes fairly.
• Distance metrics compute similarity between points.
• Euclidean distance is common for continuous data.
• Larger-scale features can bias distances without scaling.
• Normalization scales data to 0–1 range.
• Standardization rescales to mean 0, variance 1.
• Scaling improves KNN classification and regression accuracy.
• Essential when features have different units or ranges.

8. Analyze how Logistic Regression models probabilities to predict binary outcomes. Examine how adjusting the decision
threshold influences model performance, particularly in handling imbalanced datasets.
Logistic Regression outputs probabilities using the sigmoid function. The decision threshold determines how probabilities map
to class labels. Adjusting threshold affects false positives and negatives, crucial for imbalanced classes.
• Outputs class probabilities between 0 and 1.
• Default threshold is usually 0.5 for classification.
• Lowering threshold increases sensitivity (recall), catches more positives.
• Raising threshold increases specificity, reduces false positives.
• Threshold tuning helps balance errors for different applications.
• Important for imbalanced datasets where minority class matters more.
• Threshold can be optimized based on ROC or business needs.

Module-5

✅ Q1. Compare and contrast the advantages and disadvantages of ensemble methods like Bagging, Boosting, and
Stacking.
Definition:
Ensemble methods combine multiple models to improve prediction accuracy. Bagging, Boosting, and Stacking each work
differently but aim to reduce errors by using the strengths of several models together.
Method Advantages Disadvantages
1. Reduces variance2. Works well with unstable models 1. Less effective on high-bias models2. May not improve
Bagging
like decision trees accuracy for all tasks
1. Reduces bias2. Focuses on difficult samples to 1. Prone to overfitting2. Computationally expensive due to
Boosting
improve accuracy sequential learning
1. Combines strengths of different models2. Can 1. Complex to implement2. Risk of overfitting if not tuned
Stacking
outperform individual models properly

✅ Q2. Examine the trade-offs of different imputation techniques on a dataset with missing values.
Definition:
Imputation techniques replace missing data to allow complete analysis. Common methods include Mean, Median, and KNN
imputation. Each method has strengths and weaknesses depending on the data type.
Technique Advantages Disadvantages
1. Fast and simple2. Works well with normal
Mean Imputation 1. Affected by outliers2. Reduces data variability
distribution
Median 1. May ignore variable distribution2. Less informative
1. Robust to outliers2. Best for skewed data
Imputation for normal data
1. Preserves relationships2. Provides more
KNN Imputation 1. Slow on large datasets2. Needs feature scaling
accurate results

✅ Q3. Evaluate the performance of classification models using precision, recall, and F1-score.
Definition:
Precision, recall, and F1-score are evaluation metrics used to measure model performance, especially when classes are
imbalanced. They help us understand how well the model predicts the positive class.
Model Precision Recall F1-Score Best For
Logistic Regression Moderate Moderate Moderate Simple linear classification
Random Forest High High High Non-linear, complex datasets
SVM High (kernel) Moderate High if tuned High-dimensional and smaller datasets
Detailed Points:
1. Precision = TP / (TP + FP): Good when false positives are costly.
2. Recall = TP / (TP + FN): Important when missing positives is risky.
3. F1-Score balances both; used in medical or fraud detection.
4. Logistic Regression: Easy to interpret but limited to linear problems.
5. Random Forest: Excellent general performance, handles outliers and non-linearities.
6. SVM: Performs well with proper kernels but needs more computation.
7. Choose metrics based on the problem type and data imbalance.

✅ Q4. Analyze the impact of hyperparameter tuning in Random Search in deep learning models.
Definition:
Random Search is a hyperparameter tuning method where random combinations of hyperparameters are selected and tested to
find the best configuration for a model.
Key Points:
1. Hyperparameters include learning rate, batch size, number of layers, etc.
2. Random Search randomly samples values instead of checking every possibility.
3. It often finds better models faster than grid search.
4. It saves computation time by skipping unimportant combinations.
5. Can explore wide spaces, helpful for deep learning where tuning is complex.
6. Needs enough trials to find good settings; performance improves with iterations.
7. It is especially useful when some parameters are more important than others.

✅ Q5. A regression analysis between apples (y) and oranges (x) resulted in the line:
y = 100 + 2x.
Predict the implication if oranges are increased by 1.
Definition:
In a linear regression equation y = a + bx, 'a' is the intercept and 'b' is the slope. The slope tells us how much y (apples) will
change when x (oranges) increases by one unit.
Key Points:
1. The slope b = 2, so for each 1 unit increase in oranges, apples increase by 2 units.
2. If x increases by 1, then:
y = 100 + 2(x + 1) = 100 + 2x + 2 = y (original) + 2.
3. So, apples increase by 2 units when oranges increase by 1 unit.
4. The intercept 100 shows apple value when orange count is zero.
5. The model shows a positive linear relationship between apples and oranges.
6. It's a predictive model, meaning change in input (oranges) helps estimate output (apples).
7. Useful in market analysis or pricing strategy when variables are related.

You might also like