Ds 5 Marks Final
Ds 5 Marks Final
2. Break down the concept of Exploratory Data Analysis (EDA) and illustrate its significance in statistics.
Definition:
Exploratory Data Analysis (EDA) is the first step in data analysis where we explore data using charts and statistics. It helps in
understanding data patterns, spotting errors, and forming hypotheses. EDA uses visual tools like histograms and box plots.
Importance:
1. Detects Patterns: Shows trends, clusters, or relationships in data.
2. Finds Outliers: Visual tools easily highlight unusual values.
3. Tests Assumptions: Helps check if data meets conditions for modeling.
4. Summarizes Data: Gives summary statistics (mean, median, etc.).
5. Guides Model Selection: Suggests suitable statistical techniques.
6. Prepares Data: Identifies missing or incorrect data.
7. Improves Insight: Offers a deeper, intuitive feel of the data.
Graphical Tools: Histograms, box plots, scatter plots
Statistical Tools: Mean, median, variance, skewness
3. Analyze how outliers can be identified in a dataset by comparing Z-score and IQR methods.
Definition:
Outliers are extreme values that differ greatly from others. Two common methods to detect them are the Z-score method
and the IQR method.
Comparison Table:
Feature Z-Score Method IQR Method
Basis Standard deviation Spread between Q1 and Q3 (middle 50%)
Formula Z = (X - Mean)/SD Outlier if value < Q1 - 1.5×IQR or > Q3 + 1.5×IQR
Threshold Z > 3 or Z < -3 Outside Q1 − 1.5×IQR or Q3 + 1.5×IQR
Suitable for Normally distributed data Non-normal or skewed data
Easy to interpret? Yes (if mean and SD are known) Yes (uses percentiles)
Sensitive to outliers? Yes Less sensitive
Visual support Z-scores, bell curve Box plots
Conclusion:
Use Z-score for normal data, and IQR for skewed data.
4. Find the probability of throwing two fair dice when the sum is 5 and when the sum is 8.
Definition:
Probability is the chance of an event happening, calculated by dividing the number of favorable outcomes by total outcomes.
Total outcomes when rolling 2 dice = 6 × 6 = 36
Sum = 5:
Favorable pairs: (1,4), (2,3), (3,2), (4,1) → 4 outcomes
P(Sum = 5) = 4 / 36 = 1 / 9
Sum = 8:
Favorable pairs: (2,6), (3,5), (4,4), (5,3), (6,2) → 5 outcomes
P(Sum = 8) = 5 / 36
Final Answer:
• Probability (Sum = 5) = 1/9
• Probability (Sum = 8) = 5/36
5. Compare quantitative data and qualitative data.
Definition:
Quantitative data is numerical and represents measurable values (e.g., height).Qualitative data describes categories or qualities
Comparison Table:
Feature Quantitative Data Qualitative Data
Type Numeric Categorical
Example Age = 25, Weight = 60 kg Gender = Male, Color = Red
Subtypes Discrete, Continuous Nominal, Ordinal
Analysis Mean, Median, Range Mode, Frequency
Graphs Used Histogram, Boxplot Bar chart, Pie chart
Mathematical Ops Possible (add, subtract, etc.) Not possible
Use in Stats Used in regression, correlation Used in classification, grouping
6. Define covariance. Analyze how it helps in understanding the relationship between two variables.
Definition:
Covariance measures how two variables change together. A positive covariance means both increase together, while negative
means one increases as the other decreases.
Importance:
1. Shows Direction: Positive = same direction; Negative = opposite direction.
2. Measures Relationship: How one variable affects the other.
3. Supports Correlation: Covariance is used to calculate correlation.
4. Used in Finance: Helps in portfolio management to reduce risk.
5. Identifies Trends: Reveals linked behaviors (e.g., study time & scores).
6. Basis for PCA: Important in dimensionality reduction techniques.
7. Helps in Multivariate Analysis: Understands variable interaction.
Example:
If X = hours studied and Y = marks scored, high covariance shows a strong study-performance link.
7. A distribution is skewed to the right and has a median of 20. Will the mean be greater than or less than 20? Explain.
Definition:
In a right-skewed distribution, most values are on the left, with a long tail on the right. The mean is pulled in the direction of the
skew.
Explanation:
1. In right-skewed data: Mean > Median
2. The median (middle value) = 20
3. Since the tail is on the right, some high values exist.
4. These high values increase the average (mean).
5. So, the mean will be greater than 20.
6. The mode < median < mean in right-skewed data.
7. This affects statistical measures and modeling.
Conclusion:
Mean > 20 because of the skew towards higher values.
12. What is machine learning? Justify its role and importance in data science.
Definition:
Machine Learning (ML) is a branch of AI where computers learn from data to make decisions or predictions without being
explicitly programmed.
Importance in Data Science:
1. Automates Decision Making: Learns patterns and predicts future outcomes.
2. Handles Large Data: Works well with big datasets.
3. Used in Various Fields: Health, finance, marketing, etc.
4. Improves Accuracy: Often better than manual rules.
5. Learns Continuously: Models improve with new data.
6. Enables Personalization: E.g., Netflix recommendations.
7. Powers AI Applications: Like chatbots, voice assistants.
Example:
A spam filter uses machine learning to detect junk emails based on past patterns.
13. What is data preprocessing? Explain its role and common methods.
Definition:
Data preprocessing is the step of preparing raw data for analysis. It includes cleaning, transforming, and organizing data to
make it suitable for models.
Role in Data Analysis:
1. Removes noise and errors.
2. Handles missing or duplicate values.
3. Normalizes/standardizes data for algorithms.
4. Converts data types and encodes categories.
5. Improves model performance and accuracy.
6. Makes data compatible with ML algorithms.
7. Saves time and ensures consistency.
Common Methods:
• Data Cleaning: Remove/fix incorrect values.
• Normalization/Scaling: Adjust ranges (0 to 1).
• Encoding Categorical Data: Use one-hot or label encoding.
• Missing Value Treatment: Impute or drop.
• Data Transformation: Log, square root, etc.
Module 3
1. Evaluate the effectiveness of the median over the mean in specific data scenarios.
Definition:
The median is the middle value of an ordered dataset. The mean is the arithmetic average. The median is more effective than
the mean when data contains outliers or is skewed, as it is not affected by extreme values.
Why median is better in some cases:
1. Unaffected by outliers: Median ignores extreme values.
2. Better for skewed data: Accurately represents center.
3. More realistic average: In income data, median gives a fairer picture.
4. Preferred in small datasets: Especially when uneven values exist.
5. Useful in ordinal data: Like survey ranks (e.g., poor, fair, good).
6. More robust: Not easily distorted by extreme values.
7. Easier interpretation in real-world contexts.
Example:
Salaries: [20k, 25k, 27k, 30k, 1 million]
• Mean = 220.4k (misleading)
• Median = 27k (realistic)
2. Analyze the impact of different imputation techniques on a dataset with missing values by comparing their outcomes.
Definition:
Imputation is filling in missing data to avoid bias or errors in analysis. Different techniques affect the dataset and model
outcomes differently.
Key points:
1. Mean Imputation: Replaces missing values with mean of feature. Simple but can reduce variance.
2. Median Imputation: Uses median; better for skewed data as it’s less affected by outliers.
3. Mode Imputation: Replaces missing categorical data with most frequent value.
4. KNN Imputation: Uses nearest neighbors to estimate missing values; preserves data patterns.
5. Impact on Data: Simple methods may bias data; advanced methods keep distribution.
6. Effect on Models: Better imputation improves model accuracy and reduces bias.
7. Trade-offs: More complex methods require more computation but improve quality.
3. Analyze the effect of mean, median, and KNN imputation on model performance when applied to missing values.
Definition:
Imputation methods affect how models learn and predict. Choosing the right method influences accuracy, bias, and variance.
Comparison:
Imputation Method Effect on Model Performance When to Use
Mean Can bias model if outliers exist When data is symmetric and clean
Median Robust to outliers, better accuracy Skewed data or outlier presence
KNN Best preserves data structure; improves accuracy Complex data with patterns
Example:
Using KNN might improve precision by 5% over mean imputation in a health dataset.
4. Develop a real-world scenario illustrating an alternative hypothesis in hypothesis testing.
Definition:
The alternative hypothesis (H₁) is a statement that shows there is an effect or difference, opposite of the null hypothesis (H₀).
Example scenario:
In a drug trial,
• H₀: The new drug has no effect on blood pressure.
• H₁: The new drug reduces blood pressure significantly.
Justification:
1. The alternative reflects the research goal.
2. Based on preliminary data or theory.
3. It is tested to confirm the drug’s efficacy.
4. Helps decide if observed effects are real or by chance.
5. Supports decision-making in medicine.
6. Requires careful data collection.
7. Provides direction for statistical tests.
5. Critically evaluate different types of sampling bias and their impact on research validity with examples.
Definition:
Sampling bias occurs when sample data does not represent the population, leading to invalid conclusions.
Types of bias and examples:
Bias Type Description Impact Real-Life Example
Selection Bias Non-random sampling Misleading results Survey only urban population
Response Bias Respondents answer untruthfully Inaccurate data Sensitive questions in surveys
Non-response Bias Certain groups don’t respond Missing views Low participation by elderly
Sampling Frame Bias Wrong population frame used Skews data Using phone book for all residents
Impact:
• Reduces validity and reliability.
• Leads to wrong conclusions.
• Affects generalizability.
• Needs mitigation by proper sampling.
6. Critically evaluate degrees of freedom (DF) and its significance in statistical tests with an example.
Definition:
Degrees of freedom (DF) refer to the number of independent values that can vary in an analysis without violating constraints.
Significance:
1. Influences critical values in hypothesis tests.
2. Affects test statistic distributions (e.g., t-distribution).
3. Lower DF leads to wider confidence intervals.
4. Ensures proper estimation of variability.
5. Used in tests like t-test, chi-square.
6. Important for sample size considerations.
7. Example: For a t-test with n=10 samples, DF = 9.
Example:
In a one-sample t-test with 10 data points, DF = 10 - 1 = 9, influencing the t critical value.
Module 4
1. Analyze how Logistic Regression is applied to classify binary outcomes. Compare it with Linear Regression.
Definition:
Logistic Regression is a supervised classification algorithm used to predict binary outcomes (0 or 1). It models the probability
of an event occurring using the logistic (sigmoid) function, unlike Linear Regression which predicts continuous values.
Key points:
1. Logistic Regression outputs probabilities between 0 and 1, which are converted to classes using a threshold (usually
0.5).
2. It uses the sigmoid function: σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}.
3. Assumes a linear relationship between features and the log-odds of the outcome.
4. Suitable for classification tasks, unlike Linear Regression which predicts continuous outcomes.
5. Linear Regression can predict values outside [0,1], making it unsuitable for classification.
6. Logistic Regression estimates parameters using Maximum Likelihood Estimation (MLE).
7. Outputs can be interpreted as the likelihood of belonging to a class.
Aspect Logistic Regression Linear Regression
Output Probability (0 to 1) Continuous values
Function Sigmoid function Linear function
Use case Classification (binary) Regression (continuous outcomes)
Error metric Log loss Mean Squared Error
Assumptions Linear in log-odds Linear relationship between X and Y
2. Analyze how varying 'k' affects K-Nearest Neighbors (KNN) accuracy and decision boundaries.
Definition:
KNN classifies a data point based on the majority class among its ‘k’ closest neighbors. The choice of ‘k’ directly affects model
bias and variance.
Key points:
1. Small k (e.g., k=1): Low bias, high variance; model fits training data tightly, prone to noise.
2. Large k: Higher bias, lower variance; smoother decision boundaries but may overlook local patterns.
3. Small k creates complex, jagged boundaries; large k creates simpler, smoother boundaries.
4. Very large k approaches majority class baseline (poor local accuracy).
5. Optimal k balances bias-variance trade-off, often found via cross-validation.
6. Model performance can decrease if k is too small or too large.
7. Choice of k impacts computation cost (larger k means more neighbors checked).
4. Analyze how SVM separates classes using hyperplanes and kernel functions.
Definition:
Support Vector Machines (SVM) classify data by finding the hyperplane that best separates classes with the maximum margin.
Key points:
1. Finds the hyperplane maximizing margin between closest points (support vectors).
2. Works well for linearly separable data.
3. For non-linear data, kernel functions map data to higher-dimensional space.
4. Common kernels: linear, polynomial, radial basis function (RBF).
5. Kernels allow SVM to find a linear separator in transformed space.
6. SVM is robust to high-dimensional spaces.
7. Regularization controls overfitting by soft margin.
5. Analyze Decision Tree splits based on Information Gain and Gini Impurity.
Definition:
Decision Trees split nodes by selecting features that best separate classes using metrics like Information Gain or Gini Impurity.
Key points:
1. Information Gain: Measures reduction in entropy after split; higher gain preferred.
2. Gini Impurity: Measures probability of misclassification; lower impurity preferred.
3. Choice of split impacts tree accuracy and complexity.
4. Splits that maximize purity improve model predictions.
5. More splits increase tree depth and risk of overfitting.
6. Balanced splits improve generalization.
7. Feature selection at nodes influences interpretability.
8. Analyze how Logistic Regression models probabilities and effects of decision threshold adjustments.
Definition:
Logistic Regression models the probability of binary outcomes using the sigmoid function. The decision threshold determines
classification.
Key points:
1. Default threshold is 0.5 (probability above means positive class).
2. Adjusting threshold balances sensitivity (recall) and specificity.
3. Lower threshold increases recall but may increase false positives.
4. Higher threshold increases precision but may miss positives.
5. Important for imbalanced datasets (e.g., fraud detection).
6. ROC curve and Precision-Recall curve help select thresholds.
7. Threshold tuning improves model usefulness depending on problem context.
Got it! Here are answers for Module 4, Questions 1 to 8, exactly as you wrote the questions, with your requested format: a clear
4-line definition/theory in easy words, followed by detailed points and tables with only two columns where required, no generic
headings.
1. Analyze how Logistic Regression is applied to classify binary outcomes. Examine how its approach, underlying
assumptions, and output differ from those of Linear Regression.
Logistic Regression is a statistical method used to predict the probability of two possible outcomes (binary classification). It
uses the logistic (sigmoid) function to convert linear combinations of input features into probabilities between 0 and 1. Unlike
Linear Regression, it predicts class probabilities, not continuous values. It assumes a linear relationship between inputs and the
log-odds of the output.
• Predicts probabilities for binary outcomes (e.g., yes/no).
• Uses the sigmoid function to map results between 0 and 1.
• Assumes the log-odds of the outcome are a linear function of inputs.
• Outputs probabilities, unlike Linear Regression’s continuous predictions.
• Suitable for classification, not regression tasks.
• Uses maximum likelihood estimation to find model parameters.
• Can adjust decision thresholds for classification sensitivity.
2. Analyze how varying the value of 'k' influences the accuracy and decision boundaries of the K-Nearest Neighbors
(KNN) algorithm. Examine the consequences of choosing a 'k' that is too small or too large, and how it impacts model
bias, variance, and overall performance.
In KNN, 'k' is the number of neighbors considered to classify a point. A small 'k' results in a model sensitive to noise (high
variance), while a large 'k' smooths decision boundaries (high bias). Selecting the right 'k' balances accuracy by controlling bias-
variance tradeoff.
• Small 'k' (like 1) can cause overfitting and noisy boundaries.
• Large 'k' creates smoother, more generalized boundaries but may underfit.
• Small 'k' has low bias but high variance.
• Large 'k' has high bias but low variance.
• Optimal 'k' depends on data complexity and size.
• Cross-validation helps find the best 'k'.
• Too small or too large 'k' reduces model accuracy.
3. Critically evaluate how the K-Means clustering algorithm partitions data points into clusters. Justify the
methods used to determine the optimal number of clusters, such as the Elbow Method or Silhouette Score, and
assess their effectiveness in different scenarios.
K-Means divides data into k clusters by assigning points to the nearest centroid and updating centroids iteratively. To
choose the best k, methods like Elbow and Silhouette Score are used to balance cluster quality and simplicity.
Method Effectiveness
Elbow Method 1. Visualizes within-cluster variance. 2. Finds “elbow” where adding clusters adds little improvement. 3.
Method Effectiveness
Simple and intuitive. 4. Subjective and sometimes unclear. 5. Works well with well-separated clusters.
Silhouette 1. Gives numeric value for cluster quality. 2. Measures cohesion and separation. 3. More objective than Elbow.
Score 4. Useful for comparing clusterings. 5. Detects overlapping or unclear clusters.
4. Analyze how Support Vector Machines (SVM) separate data points into distinct classes by identifying optimal
hyperplanes. Examine the role of kernel functions in transforming non-linearly separable data into a higher-dimensional
space for effective classification.
SVM separates classes by finding the hyperplane that maximizes margin between them. If data isn’t linearly separable, kernels
transform data into higher dimensions to find a linear separator.
• Finds hyperplane with maximum margin between classes.
• Works well for binary classification problems.
• Maximizes distance between closest points (support vectors).
• Kernels (linear, polynomial, RBF) map data into higher dimensions.
• Helps separate complex, non-linear data.
• Prevents overfitting by maximizing margin.
• Effective in many practical applications.
5. Analyze how Decision Trees determine splits at each node based on feature selection criteria such as Information Gain
or Gini Impurity. Examine how the choice of splitting feature at each node impacts the tree’s accuracy, complexity, and
potential for overfitting.
Decision Trees split nodes using criteria that measure how well a feature separates classes. Information Gain and Gini Impurity
assess the purity improvement. Good splits increase accuracy but too many splits cause overfitting.
• Information Gain measures entropy reduction after a split.
• Gini Impurity measures probability of misclassification.
• Features with highest gain or lowest impurity are chosen.
• Early splits affect overall tree structure heavily.
• More splits improve training accuracy but may overfit.
• Selecting relevant features balances accuracy and complexity.
• Proper splitting reduces bias and variance.
6. Critically evaluate the problem of overfitting in Decision Trees and its impact on model generalization. Justify how
pruning techniques, such as pre-pruning and post-pruning, can effectively reduce overfitting and improve the model’s
performance on unseen data.
Overfitting occurs when the tree learns noise and specific patterns in training data, harming performance on new data. Pruning
removes unnecessary branches to reduce overfitting. Pre-pruning stops growth early; post-pruning trims grown trees.
Pre-pruning Post-pruning
1. Stops tree growth early 1. Grows full tree before pruning
2. Decides not to split if criteria fail 2. Removes branches after full growth
3. Faster, less complex trees 3. Simplifies complex trees
4. May miss important splits 4. Usually more accurate
5. Controls overfitting upfront 5. Requires additional validation
7. Analyze the role of distance metrics in the K-Nearest Neighbors (KNN) algorithm and how they influence the
classification or regression outcomes. Examine why feature scaling through normalization or standardization is crucial
when using KNN, particularly in datasets with varying feature ranges.
Distance metrics (like Euclidean) measure closeness of points in KNN. If features vary widely in scale, unscaled features
dominate distance calculations. Scaling ensures each feature contributes fairly.
• Distance metrics compute similarity between points.
• Euclidean distance is common for continuous data.
• Larger-scale features can bias distances without scaling.
• Normalization scales data to 0–1 range.
• Standardization rescales to mean 0, variance 1.
• Scaling improves KNN classification and regression accuracy.
• Essential when features have different units or ranges.
8. Analyze how Logistic Regression models probabilities to predict binary outcomes. Examine how adjusting the decision
threshold influences model performance, particularly in handling imbalanced datasets.
Logistic Regression outputs probabilities using the sigmoid function. The decision threshold determines how probabilities map
to class labels. Adjusting threshold affects false positives and negatives, crucial for imbalanced classes.
• Outputs class probabilities between 0 and 1.
• Default threshold is usually 0.5 for classification.
• Lowering threshold increases sensitivity (recall), catches more positives.
• Raising threshold increases specificity, reduces false positives.
• Threshold tuning helps balance errors for different applications.
• Important for imbalanced datasets where minority class matters more.
• Threshold can be optimized based on ROC or business needs.
Module-5
✅ Q1. Compare and contrast the advantages and disadvantages of ensemble methods like Bagging, Boosting, and
Stacking.
Definition:
Ensemble methods combine multiple models to improve prediction accuracy. Bagging, Boosting, and Stacking each work
differently but aim to reduce errors by using the strengths of several models together.
Method Advantages Disadvantages
1. Reduces variance2. Works well with unstable models 1. Less effective on high-bias models2. May not improve
Bagging
like decision trees accuracy for all tasks
1. Reduces bias2. Focuses on difficult samples to 1. Prone to overfitting2. Computationally expensive due to
Boosting
improve accuracy sequential learning
1. Combines strengths of different models2. Can 1. Complex to implement2. Risk of overfitting if not tuned
Stacking
outperform individual models properly
✅ Q2. Examine the trade-offs of different imputation techniques on a dataset with missing values.
Definition:
Imputation techniques replace missing data to allow complete analysis. Common methods include Mean, Median, and KNN
imputation. Each method has strengths and weaknesses depending on the data type.
Technique Advantages Disadvantages
1. Fast and simple2. Works well with normal
Mean Imputation 1. Affected by outliers2. Reduces data variability
distribution
Median 1. May ignore variable distribution2. Less informative
1. Robust to outliers2. Best for skewed data
Imputation for normal data
1. Preserves relationships2. Provides more
KNN Imputation 1. Slow on large datasets2. Needs feature scaling
accurate results
✅ Q3. Evaluate the performance of classification models using precision, recall, and F1-score.
Definition:
Precision, recall, and F1-score are evaluation metrics used to measure model performance, especially when classes are
imbalanced. They help us understand how well the model predicts the positive class.
Model Precision Recall F1-Score Best For
Logistic Regression Moderate Moderate Moderate Simple linear classification
Random Forest High High High Non-linear, complex datasets
SVM High (kernel) Moderate High if tuned High-dimensional and smaller datasets
Detailed Points:
1. Precision = TP / (TP + FP): Good when false positives are costly.
2. Recall = TP / (TP + FN): Important when missing positives is risky.
3. F1-Score balances both; used in medical or fraud detection.
4. Logistic Regression: Easy to interpret but limited to linear problems.
5. Random Forest: Excellent general performance, handles outliers and non-linearities.
6. SVM: Performs well with proper kernels but needs more computation.
7. Choose metrics based on the problem type and data imbalance.
✅ Q4. Analyze the impact of hyperparameter tuning in Random Search in deep learning models.
Definition:
Random Search is a hyperparameter tuning method where random combinations of hyperparameters are selected and tested to
find the best configuration for a model.
Key Points:
1. Hyperparameters include learning rate, batch size, number of layers, etc.
2. Random Search randomly samples values instead of checking every possibility.
3. It often finds better models faster than grid search.
4. It saves computation time by skipping unimportant combinations.
5. Can explore wide spaces, helpful for deep learning where tuning is complex.
6. Needs enough trials to find good settings; performance improves with iterations.
7. It is especially useful when some parameters are more important than others.
✅ Q5. A regression analysis between apples (y) and oranges (x) resulted in the line:
y = 100 + 2x.
Predict the implication if oranges are increased by 1.
Definition:
In a linear regression equation y = a + bx, 'a' is the intercept and 'b' is the slope. The slope tells us how much y (apples) will
change when x (oranges) increases by one unit.
Key Points:
1. The slope b = 2, so for each 1 unit increase in oranges, apples increase by 2 units.
2. If x increases by 1, then:
y = 100 + 2(x + 1) = 100 + 2x + 2 = y (original) + 2.
3. So, apples increase by 2 units when oranges increase by 1 unit.
4. The intercept 100 shows apple value when orange count is zero.
5. The model shows a positive linear relationship between apples and oranges.
6. It's a predictive model, meaning change in input (oranges) helps estimate output (apples).
7. Useful in market analysis or pricing strategy when variables are related.