B Tech-AIML-question bank-2 Answer Key
B Tech-AIML-question bank-2 Answer Key
Question Bank-2
How does constraint propagation contribute to solving Constraint Satisfaction Problems (CSPs)?
1
Concept: Reduces the search space by enforcing constraints across variables to simplify the problem.
Example: Sudoku—if a number is assigned to a cell, it cannot appear in the same row, column, or box.
Explain the importance of handling missing data in data preprocessing. Discuss two common imputation
techniques.
2
Importance: Ensures model accuracy.
Techniques: Mean imputation, KNN imputation.
What is the difference between supervised and unsupervised learning? Give an example of a problem
that would be best addressed by each type of learning.
3
High dimensions reduce model efficiency.
Solutions: PCA, feature selection.
Explain the bias-variance trade off in machine learning. How does it relate to model complexity?
Affected
Issue Description Cleaning Suggestions
Records
Missing Impute the missing value using the genre
Missing movie rating MovieID 10
Values average or median (Comedy = 3.6).
Inconsistent Unusual release year for MovieID 8 Check movie title for authenticity; adjust
Data Sci-Fi movie (2017) the year if metadata is available.
Duplicate Data None detected N/A N/A
None evident based on
Outliers N/A N/A
ratings (range 0 to 5)
Data Type
None detected N/A N/A
Issues
Cleaning Techniques
1. Handle Missing Values:
o MovieID 10 (Missing Rating):
Use the genre average (Comedy = 3.6) or assign a default like 3.5 as the likely
median for a comedy.
df.loc[df['MovieID'] == 10, 'Rating'] = 3.6
2. Fix Inconsistent Release Year:
o MovieID 8 (Release Year 2017):
Cross-check with the metadata or movie database.
Likely correction: Set it to a year matching the genre or mark it as 2022 for
consistency.
3. Ensure Data Types are Correct:
o Confirm that Rating is a floating-point column and ReleaseYear is an integer.
4. Data Validation:
o Ensure that Ratings are within valid bounds (0 to 5).
o Check for outliers in ReleaseYear.
Explain the importance of feature scaling in machine learning. Compare and contrast normalization and
standardization. Provide an example where normalization would be preferred over standardization.
import pandas as pd
# Sample dataset
data = {
8
'CustomerID': [101, 102, 101, 103, 102, 104, 101, 105, 104, 102],
'ProductCategory': ['Electronics', 'Clothing', 'Electronics', 'Furniture', 'Clothing',
'Electronics', 'Furniture', 'Electronics', 'Clothing', 'Furniture'],
'Price': [100.0, 50.0, 120.0, 250.0, 60.0, 110.0, 200.0, 90.0, 55.0, 300.0]
}
# Creating a DataFrame
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# (c) Group the transactions by product category and calculate the average price for each category
average_price_per_category = df.groupby('ProductCategory')['Price'].mean().reset_index()
average_price_per_category.columns = ['ProductCategory', 'AveragePrice']
print("\nAverage Price per Product Category:")
print(average_price_per_category)
You are given a dataset containing customer information, including age, income, and email addresses.
Some age values are missing, some income values are negative (representing errors), and some email
addresses are invalid. Describe a step-by-step process for cleaning this data using Python libraries like
Pandas and NumPy. Include specific examples of code snippets you would use for imputation (for
missing ages), handling negative incomes, and validating email addresses.
import pandas as pd
Strategy: Impute missing ages using the median of the age column.
import numpy as np
Strategy: Set negative values to NaN and impute with the median income.
Strategy: Use regex to validate email formats and drop invalid entries.
import re
You are given a dataset of sales transactions (you would provide a sample dataset). Develop a Python
script using Pandas, NumPy, Matplotlib, and Seaborn to perform a comprehensive exploratory data
analysis. Your script should include:
import pandas as pd
import numpy as np
df['Date'] = pd.to_datetime(df['Date'])
print("Initial Dataset:")
print(df.head())
print(df.isna().sum())
df['CustomerID'].fillna(method='ffill', inplace=True)
df['Quantity'].fillna(df['Quantity'].median(), inplace=True)
print(df.isna().sum())
plt.figure(figsize=(10, 5))
sns.boxplot(data=df[['Quantity', 'Revenue']])
plt.show()
Q1 = df['Revenue'].quantile(0.25)
Q3 = df['Revenue'].quantile(0.75)
IQR = Q3 - Q1
print(df)
You are given a dataset of sales transactions (you would provide a sample dataset). Develop a Python
script using Pandas, NumPy, Matplotlib, and Seaborn to perform a comprehensive exploratory data
analysis. Your script should include:
Data transformation (if necessary).
Visualizations to explore relationships between variables and identify trends.
A brief report summarizing your findings and insights from the data.
import pandas as pd
import numpy as np
# ------------------------------
# Data Transformation
# ------------------------------
11
# Example: Create a new column for Revenue per Quantity
print(df[['TransactionID', 'Revenue_per_Quantity']])
# ------------------------------
# Visualization
# ------------------------------
plt.figure(figsize=(12, 6))
plt.show()
# Bar plot for product-wise revenue
plt.figure(figsize=(10, 5))
plt.show()
plt.figure(figsize=(8, 6))
plt.title("Correlation Heatmap")
plt.show()
# ------------------------------
# ------------------------------
print("4. Product A appears to generate higher overall revenue compared to other products.")
Compare and contrast the performance of Decision Trees and Support Vector Machines on a given
classification problem (provide a small dataset). Discuss the advantages and disadvantages of each
algorithm.
Performance Comparison
Support Vector Machine
Aspect Decision Tree
(SVM)
12 Training Speed Faster for small datasets Slower for large datasets
Easy to interpret and Complex, harder to
Model Complexity
visualize visualize
Decision Non-linear decision Linear or non-linear (using
Boundaries boundaries possible kernels)
Overfitting Risk High if not pruned Less prone to overfitting
More robust due to margin
Sensitivity to Noise Sensitive to noise
optimization
Efficiency on Large
Good with large datasets Slower for large datasets
Datasets
Advantages and Disadvantages
Decision Trees
Advantages:
Easy to understand and interpret.
Fast for small to medium datasets.
Supports non-linear decision boundaries.
Disadvantages:
Prone to overfitting if not pruned.
Sensitive to small data variations.
Can be unstable with changes in data.
Support Vector Machines (SVM)
Advantages:
Effective for high-dimensional spaces.
Robust to outliers due to margin optimization.
Can model complex decision boundaries using kernels.
Disadvantages:
Slower training for large datasets.
Requires careful kernel selection and parameter tuning.
Less interpretable compared to decision
Explain the concept of hierarchical clustering. How does it differ from k-means clustering?
Hierarchical clustering is a method of clustering that builds a hierarchy of clusters by either starting
from individual data points and merging them (agglomerative approach) or starting with a single cluster
and dividing it into smaller clusters (divisive approach).
Types of Hierarchical Clustering
1. Agglomerative Clustering (Bottom-Up):
o Starts with each data point as an individual cluster.
o Iteratively merges the closest clusters until a single cluster is formed.
2. Divisive Clustering (Top-Down):
o Starts with all points in a single cluster.
o Splits clusters recursively until each data point is its own cluster.
Key Steps in Agglomerative Clustering
1. Compute a distance matrix between all points (e.g., Euclidean distance).
2. Merge the two closest clusters.
3. Update the distance matrix to reflect the new cluster.
13 4. Repeat steps 2 and 3 until a single cluster remains.
Differences Between Hierarchical Clustering and K-Means
Hierarchical
Aspect K-Means Clustering
Clustering
Cluster
Builds a hierarchy Partitions data into k clusters
Formation
Input No need for
Requires k beforehand
Parameters predefined k
Non-deterministic (depends on
Algorithm Type Deterministic
initialization)
Data Structure Dendrogram Centroid-based clustering
Slower for large
Performance Faster for large datasets
datasets
Poor for large
Scalability Good for large datasets
datasets
Explain the concept of boosting. How does it differ from bagging? Describe the AdaBoost algorithm
and its key steps.
14
Concept of Boosting
Boosting is an ensemble learning technique that combines weak learners (usually simple models like
decision trees) sequentially to create a strong predictive model. Each weak learner focuses on improving
the mistakes made by the previous ones.
Key Idea of Boosting
Assign higher weights to data points that were misclassified by previous models.
Train subsequent models to correct these errors.
Aggregate the predictions of all weak learners for the final decision.
Boosting vs. Bagging
Aspect Boosting Bagging
Model
Sequential Parallel
Combination
Correcting errors of Reducing variance by
Focus
previous learners averaging predictions
Higher due to sequential
Complexity Lower due to parallel training
training
Overfitting Risk Higher if not regularized Lower due to averaging
Example AdaBoost, Gradient Random Forest, Bootstrap
Algorithms Boosting Aggregation
AdaBoost (Adaptive Boosting) Algorithm
AdaBoost is one of the earliest and most popular boosting algorithms. It focuses on improving the
classification performance by adjusting weights based on misclassification.
Key Steps of the AdaBoost Algorithm
1. Initialize Weights:
Assign equal weights to all data points:
wi=N1
where N is the total number of training examples.
2. Train Weak Learner:
Fit a weak learner (like a decision stump) to the weighted dataset.
3. Compute Error:
Calculate the weighted error of the weak learner:
4. Compute Alpha (Model Weight):
Calculate the contribution of the weak learner:
5. Update Weights:
Update the weights of the data points:
Normalize the weights to sum up to 1.
6. Repeat:
Repeat steps 2 to 5 for a specified number of iterations or until error converges.
Final Model:
Combine the weak learners' predictions using their weights α