0% found this document useful (0 votes)
7 views

B Tech-AIML-question bank-2 Answer Key

The document discusses various concepts in Artificial Intelligence and Machine Learning, including constraint propagation in CSPs, data preprocessing techniques for handling missing data, and the differences between supervised and unsupervised learning. It also covers the bias-variance trade-off, clustering methods, exploratory data analysis (EDA) techniques, and data cleaning processes using Python. Additionally, it compares Decision Trees and Support Vector Machines in classification problems and explains hierarchical clustering.

Uploaded by

binitn845
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

B Tech-AIML-question bank-2 Answer Key

The document discusses various concepts in Artificial Intelligence and Machine Learning, including constraint propagation in CSPs, data preprocessing techniques for handling missing data, and the differences between supervised and unsupervised learning. It also covers the bias-variance trade-off, clustering methods, exploratory data analysis (EDA) techniques, and data cleaning processes using Python. Additionally, it compares Decision Trees and Support Vector Machines in classification problems and explains hierarchical clustering.

Uploaded by

binitn845
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Artificial Intelligence and Machine Learning

Question Bank-2
How does constraint propagation contribute to solving Constraint Satisfaction Problems (CSPs)?
1
Concept: Reduces the search space by enforcing constraints across variables to simplify the problem.
Example: Sudoku—if a number is assigned to a cell, it cannot appear in the same row, column, or box.
Explain the importance of handling missing data in data preprocessing. Discuss two common imputation
techniques.
2
 Importance: Ensures model accuracy.
 Techniques: Mean imputation, KNN imputation.
What is the difference between supervised and unsupervised learning? Give an example of a problem
that would be best addressed by each type of learning.
3
 High dimensions reduce model efficiency.
 Solutions: PCA, feature selection.
Explain the bias-variance trade off in machine learning. How does it relate to model complexity?

Bias: Error from oversimplifying the model.


4
Variance: Error from model sensitivity to training data fluctuations.
Relation to Model Complexity: Complex models have low bias but high variance, and simple models have high
bias but low variance.
How does k-means clustering differ from hierarchical clustering? Discuss a scenario where k-means
would be more appropriate.
5
K-Means: Partitioned, fast, requires a predefined number of clusters.
Hierarchical: Builds a tree of clusters, does not require a predefined number. Scenario for K-Means: Suitable
for large datasets with well-defined cluster numbers, such as customer segmentation.
Given a dataset, perform a basic EDA. Identify potential data quality issues and suggest appropriate
cleaning techniques.
Movie rating Dataset
MovieID,Title,Genre,Rating,ReleaseYear
1,Action Movie 1,Action,4.5,2020
2,Comedy Nights,Comedy,3.8,2021
3,Drama Story,Drama,4.2,2019
4,Sci-Fi World,Sci-Fi,4.0,2022
5,Action 2,Action,4.7,2023
6,Comedy Club,Comedy,3.5,2020
7,Drama Time,Drama,4.5,2021
8,Sci-Fi 2049,Sci-Fi,4.8,2017 <-- Inconsistent Release Year
9,Action 3,Action,3.9,2022
10,Comedy Central,Comedy, ,2023 <-- Missing Rating
11,Drama Life,Drama,4.3,2020
6 12,Sci-Fi X,Sci-Fi,4.1,2021

Affected
Issue Description Cleaning Suggestions
Records
Missing Impute the missing value using the genre
Missing movie rating MovieID 10
Values average or median (Comedy = 3.6).
Inconsistent Unusual release year for MovieID 8 Check movie title for authenticity; adjust
Data Sci-Fi movie (2017) the year if metadata is available.
Duplicate Data None detected N/A N/A
None evident based on
Outliers N/A N/A
ratings (range 0 to 5)
Data Type
None detected N/A N/A
Issues

Cleaning Techniques
1. Handle Missing Values:
o MovieID 10 (Missing Rating):
 Use the genre average (Comedy = 3.6) or assign a default like 3.5 as the likely
median for a comedy.
df.loc[df['MovieID'] == 10, 'Rating'] = 3.6
2. Fix Inconsistent Release Year:
o MovieID 8 (Release Year 2017):
 Cross-check with the metadata or movie database.
 Likely correction: Set it to a year matching the genre or mark it as 2022 for
consistency.
3. Ensure Data Types are Correct:
o Confirm that Rating is a floating-point column and ReleaseYear is an integer.
4. Data Validation:
o Ensure that Ratings are within valid bounds (0 to 5).
o Check for outliers in ReleaseYear.

Explain the importance of feature scaling in machine learning. Compare and contrast normalization and
standardization. Provide an example where normalization would be preferred over standardization.

Feature scaling is crucial in machine learning for the following reasons:


1. Improves Model Performance: Many machine learning models rely on the distance between
data points (e.g., k-NN, SVMs). If features are on different scales, the model may give undue
importance to features with larger scales.
2. Speeds Up Convergence: Gradient descent converges faster when features are scaled because
it avoids the problem of zig-zagging during optimization.
3. Ensures Equal Weighting: Without scaling, features with larger numerical ranges may
dominate the training process.
4. Required for Distance-Based Models: Algorithms such as k-means and hierarchical clustering
are highly sensitive to the scale of features.
Normalization vs. Standardization
7 Normalization (Min-Max Standardization (Z-score
Aspect
Scaling) Scaling)
Formula x′=x−xmin/xmax−xmin x′=x−μ/σ
Centers data to mean 0 and
Range Scales data to [0, 1]
standard deviation 1
Effect on Less sensitive due to scaling with
Sensitive to outliers
Outliers standard deviation
When data follows a Gaussian
When data is bounded or
(normal) distribution or in models
Use Case needs to fit within specific
requiring assumptions of normally
ranges
distributed data
Linear Regression, Logistic
Examples Image pixel scaling
Regression
Write Python code using Pandas to perform the following tasks: (a) Calculate the total spending for each
customer. (b) Identify the top 10 most frequent customers. (c) Group the transactions by product category
and calculate the average price for each category.

import pandas as pd

# Sample dataset
data = {
8
'CustomerID': [101, 102, 101, 103, 102, 104, 101, 105, 104, 102],
'ProductCategory': ['Electronics', 'Clothing', 'Electronics', 'Furniture', 'Clothing',
'Electronics', 'Furniture', 'Electronics', 'Clothing', 'Furniture'],
'Price': [100.0, 50.0, 120.0, 250.0, 60.0, 110.0, 200.0, 90.0, 55.0, 300.0]
}

# Creating a DataFrame
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# (a) Calculate the total spending for each customer


total_spending = df.groupby('CustomerID')['Price'].sum().reset_index()
total_spending.columns = ['CustomerID', 'TotalSpending']
print("\nTotal Spending for Each Customer:")
print(total_spending)

# (b) Identify the top 10 most frequent customers


customer_frequency = df['CustomerID'].value_counts().head(10).reset_index()
customer_frequency.columns = ['CustomerID', 'Frequency']
print("\nTop 10 Most Frequent Customers:")
print(customer_frequency)

# (c) Group the transactions by product category and calculate the average price for each category
average_price_per_category = df.groupby('ProductCategory')['Price'].mean().reset_index()
average_price_per_category.columns = ['ProductCategory', 'AveragePrice']
print("\nAverage Price per Product Category:")
print(average_price_per_category)
You are given a dataset containing customer information, including age, income, and email addresses.
Some age values are missing, some income values are negative (representing errors), and some email
addresses are invalid. Describe a step-by-step process for cleaning this data using Python libraries like
Pandas and NumPy. Include specific examples of code snippets you would use for imputation (for
missing ages), handling negative incomes, and validating email addresses.

Step-by-Step Data Cleaning Process

1. Load the Dataset

Start by reading the dataset into a Pandas DataFrame.

import pandas as pd

# Load the dataset


df = pd.read_csv("customer_data.csv")

# Check the first few rows


print(df.head())
9

2. Handling Missing Age Values

Strategy: Impute missing ages using the median of the age column.
import numpy as np

# Check for missing values in the 'age' column


print("Missing Age Values:", df['age'].isna().sum())

# Impute missing ages with the median


df['age'].fillna(df['age'].median(), inplace=True)

3. Handling Negative Income Values

Strategy: Set negative values to NaN and impute with the median income.

Ensure all values are positive afterward.

# Replace negative income values with NaN


df.loc[df['income'] < 0, 'income'] = np.nan
# Impute missing income with the median
df['income'].fillna(df['income'].median(), inplace=True)

4. Validating Email Addresses

Strategy: Use regex to validate email formats and drop invalid entries.
import re

# Define a function to validate email addresses


def is_valid_email(email):
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, str(email)))

# Create a mask for valid emails


df['valid_email'] = df['email'].apply(is_valid_email)

# Remove rows with invalid emails


df = df[df['valid_email']].drop(columns=['valid_email'])

5. Detect and Handle Outliers (Optional)

Use the IQR method for further cleanup.

# Compute IQR for income


Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1

# Remove outliers outside 1.5 * IQR


df = df[~((df['income'] < (Q1 - 1.5 * IQR)) | (df['income'] > (Q3 + 1.5 * IQR)))]

6. Save the Cleaned Data


# Save the cleaned DataFrame
df.to_csv("cleaned_customer_data.csv", index=False)

You are given a dataset of sales transactions (you would provide a sample dataset). Develop a Python
script using Pandas, NumPy, Matplotlib, and Seaborn to perform a comprehensive exploratory data
analysis. Your script should include:

 Data loading and initial inspection.


 Handling missing values using appropriate imputation techniques.
 Identifying and handling outliers using suitable methods.

 Data loading and initial inspection.


 Handling missing values using appropriate imputation techniques.
10  Identifying and handling outliers using suitable methods.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Load the dataset


df = pd.DataFrame(sample_data)

# Convert Date column to datetime

df['Date'] = pd.to_datetime(df['Date'])

print("Initial Dataset:")

print(df.head())

Handling Missing Values

print("\nMissing Values Before Handling:")

print(df.isna().sum())

# Impute missing CustomerID with forward fill

df['CustomerID'].fillna(method='ffill', inplace=True)

# Impute missing Quantity with the median

df['Quantity'].fillna(df['Quantity'].median(), inplace=True)

print("\nMissing Values After Handling:")

print(df.isna().sum())

# Identifying and Handling Outliers

plt.figure(figsize=(10, 5))

sns.boxplot(data=df[['Quantity', 'Revenue']])

plt.title("Boxplot for Quantity and Revenue")

plt.show()

# Handle outliers using the IQR method for Revenue

Q1 = df['Revenue'].quantile(0.25)

Q3 = df['Revenue'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

# Filter out outliers


df = df[(df['Revenue'] >= lower_bound) & (df['Revenue'] <= upper_bound)]

print("\nDataset after outlier handling:")

print(df)

You are given a dataset of sales transactions (you would provide a sample dataset). Develop a Python
script using Pandas, NumPy, Matplotlib, and Seaborn to perform a comprehensive exploratory data
analysis. Your script should include:
 Data transformation (if necessary).
 Visualizations to explore relationships between variables and identify trends.
 A brief report summarizing your findings and insights from the data.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

Data transformation (if necessary).

Visualizations to explore relationships between variables and identify trends.

# ------------------------------

# Data Transformation

# ------------------------------

11
# Example: Create a new column for Revenue per Quantity

print("\nData Transformation: Revenue per Quantity")

df['Revenue_per_Quantity'] = df['Revenue'] / df['Quantity']

print(df[['TransactionID', 'Revenue_per_Quantity']])

# ------------------------------

# Visualization

# ------------------------------

# Line plot for revenue trend over time

plt.figure(figsize=(12, 6))

sns.lineplot(x='Date', y='Revenue', data=df, marker='o')

plt.title("Revenue Trend Over Time")

plt.show()
# Bar plot for product-wise revenue

plt.figure(figsize=(10, 5))

sns.barplot(x='Product', y='Revenue', data=df, estimator=np.sum)

plt.title("Total Revenue by Product")

plt.show()

# Heatmap for correlation between numerical features

plt.figure(figsize=(8, 6))

sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')

plt.title("Correlation Heatmap")

plt.show()

# ------------------------------

# Report Findings and Insights

# ------------------------------

print("\nInsights from the Analysis:")

print("1. No more missing values in the dataset after appropriate imputation.")

print("2. Outliers in Revenue were detected and removed based on IQR.")

print("3. Quantity and Revenue show a positive correlation.")

print("4. Product A appears to generate higher overall revenue compared to other products.")

print("5. Revenue trend shows a steady increase over time.")

Compare and contrast the performance of Decision Trees and Support Vector Machines on a given
classification problem (provide a small dataset). Discuss the advantages and disadvantages of each
algorithm.

Performance Comparison
Support Vector Machine
Aspect Decision Tree
(SVM)
12 Training Speed Faster for small datasets Slower for large datasets
Easy to interpret and Complex, harder to
Model Complexity
visualize visualize
Decision Non-linear decision Linear or non-linear (using
Boundaries boundaries possible kernels)
Overfitting Risk High if not pruned Less prone to overfitting
More robust due to margin
Sensitivity to Noise Sensitive to noise
optimization
Efficiency on Large
Good with large datasets Slower for large datasets
Datasets
Advantages and Disadvantages
Decision Trees
Advantages:
 Easy to understand and interpret.
 Fast for small to medium datasets.
 Supports non-linear decision boundaries.
Disadvantages:
 Prone to overfitting if not pruned.
 Sensitive to small data variations.
 Can be unstable with changes in data.
Support Vector Machines (SVM)
Advantages:
 Effective for high-dimensional spaces.
 Robust to outliers due to margin optimization.
 Can model complex decision boundaries using kernels.
Disadvantages:
 Slower training for large datasets.
 Requires careful kernel selection and parameter tuning.
Less interpretable compared to decision
Explain the concept of hierarchical clustering. How does it differ from k-means clustering?

Hierarchical clustering is a method of clustering that builds a hierarchy of clusters by either starting
from individual data points and merging them (agglomerative approach) or starting with a single cluster
and dividing it into smaller clusters (divisive approach).
Types of Hierarchical Clustering
1. Agglomerative Clustering (Bottom-Up):
o Starts with each data point as an individual cluster.
o Iteratively merges the closest clusters until a single cluster is formed.
2. Divisive Clustering (Top-Down):
o Starts with all points in a single cluster.
o Splits clusters recursively until each data point is its own cluster.
Key Steps in Agglomerative Clustering
1. Compute a distance matrix between all points (e.g., Euclidean distance).
2. Merge the two closest clusters.
3. Update the distance matrix to reflect the new cluster.
13 4. Repeat steps 2 and 3 until a single cluster remains.
Differences Between Hierarchical Clustering and K-Means
Hierarchical
Aspect K-Means Clustering
Clustering
Cluster
Builds a hierarchy Partitions data into k clusters
Formation
Input No need for
Requires k beforehand
Parameters predefined k
Non-deterministic (depends on
Algorithm Type Deterministic
initialization)
Data Structure Dendrogram Centroid-based clustering
Slower for large
Performance Faster for large datasets
datasets
Poor for large
Scalability Good for large datasets
datasets
Explain the concept of boosting. How does it differ from bagging? Describe the AdaBoost algorithm
and its key steps.
14
Concept of Boosting
Boosting is an ensemble learning technique that combines weak learners (usually simple models like
decision trees) sequentially to create a strong predictive model. Each weak learner focuses on improving
the mistakes made by the previous ones.
Key Idea of Boosting
 Assign higher weights to data points that were misclassified by previous models.
 Train subsequent models to correct these errors.
 Aggregate the predictions of all weak learners for the final decision.
Boosting vs. Bagging
Aspect Boosting Bagging
Model
Sequential Parallel
Combination
Correcting errors of Reducing variance by
Focus
previous learners averaging predictions
Higher due to sequential
Complexity Lower due to parallel training
training
Overfitting Risk Higher if not regularized Lower due to averaging
Example AdaBoost, Gradient Random Forest, Bootstrap
Algorithms Boosting Aggregation
AdaBoost (Adaptive Boosting) Algorithm
AdaBoost is one of the earliest and most popular boosting algorithms. It focuses on improving the
classification performance by adjusting weights based on misclassification.
Key Steps of the AdaBoost Algorithm
1. Initialize Weights:
Assign equal weights to all data points:
wi=N1
where N is the total number of training examples.
2. Train Weak Learner:
Fit a weak learner (like a decision stump) to the weighted dataset.
3. Compute Error:
Calculate the weighted error of the weak learner:
4. Compute Alpha (Model Weight):
Calculate the contribution of the weak learner:
5. Update Weights:
Update the weights of the data points:
Normalize the weights to sum up to 1.
6. Repeat:
Repeat steps 2 to 5 for a specified number of iterations or until error converges.
Final Model:
Combine the weak learners' predictions using their weights α

You might also like