AI QA
AI QA
Give Examples
BASED ON CAPABILITIES:
1. Artificial Narrow AI
Artificial Narrow Intelligence, also known as Weak AI (what we refer to as Narrow AI), is the
only type of AI that exists today. Any other form of AI is theoretical. It can be trained to
perform a single or narrow task, often far faster and better than a human mind can.
However, it can’t perform outside of its defined task. Instead, it targets a single subset of
cognitive abilities and advances in that spectrum. Siri, Amazon’s Alexa and IBM Watson® are
examples of Narrow AI. Even OpenAI’s ChatGPT is considered a form of Narrow AI because
it’s limited to the single task of text-based chat.
2. General AI
Artificial General Intelligence (AGI), also known as Strong AI, is today nothing more than a
theoretical concept. AGI can use previous learnings and skills to accomplish new tasks in a
different context without the need for human beings to train the underlying models. This
ability allows AGI to learn and perform any intellectual task that a human being can.
3. Super AI
Super AI is commonly referred to as artificial superintelligence and, like AGI, is strictly
theoretical. If ever realized, Super AI would think, reason, learn, make judgements and
possess cognitive abilities that surpass those of human beings.
The applications possessing Super AI capabilities will have evolved beyond the point of
understanding human sentiments and experiences to feel emotions, have needs and possess
beliefs and desires of their own.
BASED ON FUNCTIONALITIES:
1. Reactive Machine AI
Reactive machines are AI systems with no memory and are designed to perform a very
specific task. Since they can’t recollect previous outcomes or decisions, they only work with
presently available data. Reactive AI stems from statistical math and can analyze vast
amounts of data to produce a seemingly intelligent output.
2. Limited Memory AI
Unlike Reactive Machine AI, this form of AI can recall past events and outcomes and monitor
specific objects or situations over time. Limited Memory AI can use past- and present-
moment data to decide on a course of action most likely to help achieve a desired outcome.
However, while Limited Memory AI can use past data for a specific amount of time, it can’t
retain that data in a library of past experiences to use over a long-term period. As it’s trained
on more data over time, Limited Memory AI can improve in performance.
3. Theory of Mind AI
Theory of Mind AI is a functional class of AI that falls underneath the General AI. Though an
unrealized form of AI today, AI with Theory of Mind functionality would understand the
thoughts and emotions of other entities. This understanding can affect how the AI interacts
with those around them. In theory, this would allow the AI to simulate human-like
relationships.
Because Theory of Mind AI could infer human motives and reasoning, it would personalize its
interactions with individuals based on their unique emotional needs and intentions. Theory
of Mind AI would also be able to understand and contextualize artwork and essays, which
today’s generative AI tools are unable to do.
4. Self-Aware AI
Self-Aware AI is a kind of functional AI class for applications that would possess super AI
capabilities. Like theory of mind AI, Self-Aware AI is strictly theoretical. If ever achieved, it
would have the ability to understand its own internal conditions and traits along with human
emotions and thoughts. It would also have its own set of emotions, needs and beliefs.
Model-Based Agents
1. Model-based agent utilizes the condition-action rule, where it works by finding a rule that
will allow the condition, which is based on the current situation, to be satisfied.
2. Irrespective of the first type, it can handle partially observable environments by tracking the
situation and using a particular model related to the world.
3. It consists of two important factors, which are Model and Internal State.
4. Model provides knowledge and understanding of the process of occurrence of different
things in the surroundings such that the current situation can be studied and a condition can
be created. Actions are performed by the agent based on this model.
5. Internal State uses the perceptual history to represent a current percept. The agent keeps a
track of this internal state and is adjusted by each of the percepts. The current internal state
is stored by the agent inside it to maintain a kind of structure that can describe the unseen
world.
6. The state of the agent can be updated by gaining information about how the world evolves
and how the agent's action affects the world.
7. Example: A vacuum cleaner that uses sensors to detect dirt and obstacles and moves and
cleans based on a model.
Goal-Based Agents
1. This type takes decisions on the basis of its goal or desirable situations so that it can choose
such an action that can achieve the goal required.
2. It is an improvement over model based agent where information about the goal is also
included. This is because it is not always sufficient to know just about the current state,
knowledge of the goal is a more beneficial approach.
3. The aim is to reduce the distance between action and the goal so that the best possible way
can be chosen from multiple possibilities. Once the best way is found, the decision is
represented explicitly which makes the agent more flexible.
4. It carries out considerations of different situations called searching and planning by
considering long sequence of possible actions for confirming its ability to achieve the goal.
This makes the agent proactive.
5. It can easily change its behavior if required.
6. Example: A chess-playing AI whose goal is winning the game.
Utility-Based Agents
1. Utility agent have their end uses as their building blocks and is used when best action and
decision needs to be taken from multiple alternatives.
2. It is an improvement over goal based agent as it not only involves the goal but also the way
the goal can be achieved such that the goal can be achieved in a quicker, safer, cheaper way.
3. The extra component of utility or method to achieve a goal provides a measure of success at
a particular state that makes the utility agent different.
4. It takes the agent happiness into account and gives an idea of how happy the agent is
because of the utility and hence, the action with maximum utility is considered. This
associated degree of happiness can be calculated by mapping a state onto a real number.
5. Mapping of a state onto a real number with the help of utility function gives the efficiency of
an action to achieve the goal.
6. Example: A delivery drone that delivers packages to customers efficiently while optimizing
factors like delivery time, energy consumption, and customer satisfaction.
Learning Agents
1. Learning agent, as the name suggests, has the capability to learn from past experiences and
takes actions or decisions based on learning capabilities. Example: A spam filter that learns
from user feedback.
2. It gains basic knowledge from past and uses that learning to act and adapt automatically.
3. It comprises of four conceptual components, which are given as follows:
• Learning element: It makes improvements by learning from the environment.
• Critic: Critic provides feedback to the learning agent giving the performance measure of the
agent with respect to the fixed performance standard.
• Performance element: It selects the external action.
• Problem generator: This suggests actions that lead to new and informative experiences.
3. What is heuristic search?
Heuristic search techniques have been conceived in the realm of artificial intelligence (AI)
and are effectively utilized to hunt for the most optimal solution among multiple possible
options. They are algorithms used for problem-solving and decision-making, particularly in
large and complex search spaces where exhaustively searching the entire space is
computationally infeasible.
Advantages inherent to Heuristic Search Techniques feature include:
• Speed: Owing to their nature of bypassing the need to evaluate all possible outcomes,
heuristic algorithms generally provide solutions much faster than other methods.
• Simplicity: The implementation of basic heuristic algorithms can be quite straightforward.
This reduces the amount of resources necessary in terms of memory and computational
power.
• Adaptive: In scenarios where problem constraints or requirements change, heuristic
algorithms can be more easily adapted and recalibrated.
• Cost-Effective: Considering the fact that these techniques do not require massive
computational resources or complex systems, they tend to be very cost-effective.
• Scalability: Heuristic search algorithms can effectively handle problems with large search
spaces by intelligently navigating through the problem space.
Despite their myriad advantages, organisations must consider the certain limitations that
heuristic search techniques are associated with:
• No Guaranteed Optimal Solution: While they are designed to find high-quality solutions,
heuristic search techniques do not guarantee that the most optimal solution will be found.
• Lack of Predictability: The nature of heuristic techniques is such that they do not provide
predictability in terms of the outcome; the same algorithm could produce different results
when run multiple times.
• Complexity: While basic heuristics are typically simple to apply, some complex heuristic
techniques involve sophisticated processes that may necessitate experienced professionals
for implementation.
• Overfitting: There is a risk that the heuristic might excessively tailor itself to the specifics of a
problem, thus performing poorly when applied to new but similar problems.
• Tuning Issues: The success of heuristic methods can heavily depend on the correctness of
predefined parameters. Inaccurate tuning can lead to suboptimal solutions or unexpected
results.
BFS uses the concept of a Priority queue and heuristic search. To search the graph space, the
BFS method uses two lists for tracking the traversal. An ‘Open’ list that keeps track of the
current ‘immediate’ nodes available for traversal and a ‘CLOSED’ list that keeps track of the
nodes already traversed.
Best First Search (BFS) follows a graph by using a priority queue and heuristics. It keeps an
‘Open’ list for nodes that need exploring and a ‘Closed’ list for those already checked. Here’s
how it operates:
• We start from source “S” and search for goal “I” using given costs and Best First search.
• pq initially contains S
o We remove S from pq and process unvisited neighbors of S to pq.
o pq now contains {A, C, B} (C is put before B because C has lesser cost)
6. WDYM by MEANS-END-ANALYSIS?
o Means-Ends Analysis.
o Means-Ends Analysis is problem-solving techniques used in Artificial intelligence for limiting
search in AI programs.
o It is a mixture of Backward and forward search technique.
o The MEA technique was first introduced in 1961 by Allen Newell, and Herbert A. Simon in
their problem-solving computer program, which was named as General Problem Solver
(GPS).
o The MEA analysis process centered on the evaluation of the difference between the current
state and goal state.
How means-ends analysis Works:
The means-ends analysis process can be applied recursively for a problem. It is a strategy to
control search in problem-solving. Following are the main Steps which describes the working
of MEA technique for solving a problem.
1. First, evaluate the difference between Initial State and final State.
2. Select the various operators which can be applied for each difference.
3. Apply the operator at each difference, which reduces the difference between the current
state and goal state.
Operator Subgoaling
In the MEA process, we detect the differences between the current state and goal state.
Once these differences occur, then we can apply an operator to reduce the differences. But
sometimes it is possible that an operator cannot be applied to the current state. So we
create the subproblem of the current state, in which operator can be applied, such type of
backward chaining in which operators are selected, and then sub goals are set up to establish
the preconditions of the operator is called Operator Subgoaling.
Algorithm for Means-Ends Analysis:
Let's we take Current state as CURRENT and Goal State as GOAL, then following are the steps
for the MEA algorithm.
o Step 1: Compare CURRENT to GOAL, if there are no differences between both then return
Success and Exit.
o Step 2: Else, select the most significant difference and reduce it by doing the following steps
until the success or failure occurs.
o Select a new operator O which is applicable for the current difference, and if there is
no such operator, then signal failure.
o Attempt to apply operator O to CURRENT. Make a description of two states.
i) O-Start, a state in which O’s preconditions are satisfied.
ii) O-Result, the state that would result if O were applied In O-start.
o If
(First-Part <------ MEA (CURRENT, O-START)
And
(LAST-Part <----- MEA (O-Result, GOAL), are successful, then signal Success and
return the result of combining FIRST-PART, O, and LAST-PART.
Let's take an example where we know the initial state and goal state as given below. In this problem,
we need to get the goal state by finding differences between the initial state and goal state and
applying operators.
Solution:
To solve the above problem, we will first find the differences between initial states and goal states,
and for each difference, we will generate a new state and will apply the operators. The operators we
have for this problem are:
o Move
o Delete
o Expand
1. Evaluating the initial state: In the first step, we will evaluate the initial state and will compare the
initial and Goal state to find the differences between both states.
2. Applying Delete operator: As we can check the first difference is that in goal state there is no dot
symbol which is present in the initial state, so, first we will apply the Delete operator to remove this
dot.
3. Applying Move Operator: After applying the Delete operator, the new state occurs which we will
again compare with goal state. After comparing these states, there is another difference that is the
square is outside the circle, so, we will apply the Move Operator.
4. Applying Expand Operator: Now a new state is generated in the third step, and we will compare
this state with the goal state. After comparing the states there is still one difference which is the size
of the square, so, we will apply Expand operator, and finally, it will generate the goal state
DEEP LEARNING :
MODULE 6:
• Current Capabilities: Generative AI, such as GPT models, can handle customer
queries by understanding the intent, fetching relevant information, and providing
responses in a conversational style. They are capable of automating tasks like
answering frequently asked questions, processing form submissions, and resolving
simple troubleshooting issues.
• Limitations: However, generative AI models are limited in their ability to handle
tasks that involve deep domain expertise, complex problem-solving, or emotional
intelligence. They may struggle with unique or nuanced customer issues that require
empathy, ethical judgment, or in-depth industry-specific knowledge.
• Hybrid Approaches: A practical approach is to use generative AI for first-line
support and automated responses, while routing more complex cases to human agents.
This can help reduce the workload on support teams, allowing them to focus on more
challenging tasks.
• Conclusion: While generative AI can significantly enhance support services by
automating routine tasks and improving efficiency, it cannot fully replace human
support agents. Human agents provide nuanced understanding, empathy, and ethical
considerations, which are essential in complex or sensitive situations.
• Architecture Features:
o Multi-Scale Discriminator: Incorporates multiple discriminators that operate
at different scales (e.g., low, medium, and high resolution). This enables the
GAN to capture both coarse and fine-grained details, improving the overall
quality of the generated outputs.
o Feature Matching Loss: Aims to prevent mode collapse by encouraging the
generator to produce diverse outputs that cover the real data distribution. It
works by matching the intermediate feature activations of the discriminator for
real and generated samples.
o Progressive Growing of GANs (PGGAN): The generator starts with a low
resolution and progressively adds layers to increase the resolution during
training. This incremental approach stabilizes the training process, helping the
GAN converge more effectively.
o Spectral Normalization: Applied to the weights of the discriminator to
control the Lipschitz constant, helping to stabilize training and address issues
related to convergence.
• Justification: These design choices address common issues in GAN training:
o Mode Collapse: Multi-scale discriminators and feature matching loss help
prevent the generator from producing limited output varieties.
o Convergence Stability: Progressive growing and spectral normalization
ensure the GAN learns gradually and avoids instability during training.
• Applications: This architecture can be applied in high-quality image synthesis, data
augmentation for rare medical conditions, or generating realistic textures in computer
graphics.
• Advantages:
o GANs are known for their ability to generate highly realistic and diverse
synthetic data by learning the data distribution through adversarial training.
They do not rely on explicit reconstruction objectives, allowing them to focus
on producing outputs that look "real" to the discriminator.
o Traditional Models (e.g., Autoencoders) use reconstruction loss, which may
lead to blurry outputs because the model aims to minimize the difference
between the input and output rather than generating a wide variety of realistic
data points.
• Examples in Medical Imaging:
o Data Augmentation: GANs can generate synthetic images to augment small
datasets, helping improve the performance of deep learning models in medical
image analysis.
o Image Super-Resolution: GANs can enhance the resolution of medical scans
(e.g., MRI or CT images) to reveal finer details that might be missed in
standard imaging.
• Underlying Mechanism: The adversarial loss drives the generator to produce
realistic samples by "fooling" the discriminator into classifying them as real. This
adversarial training setup allows GANs to generate data that closely mimics the
distribution of the original data.
• Standard RNNs: Have limited capability for long-term dependencies due to the
vanishing gradient problem. They can be used for simpler temporal tasks like short-
term signal classification but struggle with sequences that require long memory.
• LSTMs: Utilize memory cells and gating mechanisms to retain and update
information over time, making them suitable for tasks like ECG analysis, where long-
term dependencies are essential for detecting abnormalities.
• GRUs: Offer similar benefits as LSTMs with a simpler architecture and fewer
parameters. They are often used in tasks like wearable sensor data analysis, where
computational resources may be limited.
• Performance Analysis: For biomedical applications requiring detailed temporal
modeling (e.g., genome sequencing), LSTMs are often preferred for their ability to
capture long-term dependencies. For real-time processing, GRUs provide a balance
between performance and computational efficiency.
• Approaches:
o Attention Mechanisms: Enhance the interpretability of RNNs by highlighting
important time steps or features that the model focuses on for its predictions.
For example, in medical diagnosis, attention scores can help identify which
symptoms or medical history records were most influential in the prediction.
o Gradient-Based Methods (e.g., Saliency Maps): Use gradient information to
determine the contribution of input features to the model's output. This can
help understand how individual data points influence the prediction.
o Layer-Wise Relevance Propagation: Decomposes the prediction into the
contributions of each input feature, providing insight into the model's
decision-making process.
• Contribution to Trustworthiness: By making model predictions more interpretable,
these advancements help build confidence among stakeholders, especially in high-
stakes applications like healthcare and finance. This transparency can also aid in
identifying biases and improving model fairness.
• Design Choices:
o Separate LSTM Layers for Each Modality: This approach allows each
modality (e.g., audio, video, and sensor data) to have dedicated LSTM layers
that learn specific temporal patterns associated with that type of data. This
modularity facilitates effective learning from heterogeneous data sources.
o Fusion Layer: A subsequent layer that combines the outputs of the separate
LSTM layers into a unified representation, allowing the model to integrate
information from all modalities before making predictions.
o Real-Time Constraints: Incorporating batch normalization and low-latency
techniques can help ensure timely predictions without compromising
accuracy.
• Justification: Processing each modality independently before fusion ensures that the
unique characteristics of each input type are captured. This is crucial for applications
such as real-time emotion detection, where both audio and visual cues provide
complementary information.
• Challenges:
o Bias: Predictive policing models can perpetuate existing biases in historical
crime data, potentially leading to unfair targeting of specific communities.
o Transparency and Accountability: The black-box nature of LSTM models
can make it difficult to understand why certain predictions were made,
complicating efforts to hold systems accountable.
• Opportunities: When used responsibly, LSTM-based systems can help law
enforcement allocate resources more effectively by predicting high-risk areas or times
for certain types of crime.
• Mitigation Strategies: Implementing fairness-aware algorithms, using diverse and
representative datasets, and applying explainable AI techniques can help address these
ethical concerns. Regular audits of the model's performance and impact on different
communities are also essential.
• Design Elements:
o Hierarchical Latent Variables: Multiple layers of latent variables capture
complex relationships within the data. This hierarchical approach allows the
model to represent data at different levels of abstraction, enabling it to learn
fine-grained details and high-level concepts simultaneously.
o Skip Connections: Facilitate information flow from input to output layers,
improving gradient propagation and helping the model learn better
representations.
o Probabilistic Decoders: Use hierarchical latent variables to generate data
samples at different resolutions, improving the quality of generated data,
especially in high-dimensional applications like 3D medical imaging.
• Justification: Incorporating hierarchical latent variables enables the model to
disentangle complex data into simpler components, making it better suited for
capturing intricate dependencies in high-dimensional datasets.
MODULE 5:
3. Show that if the activation function of the hidden units is linear, a 3-layer (1
input layer x, 1 hidden layer h and 1 output layer y) network is equivalent to a
2-layer one. Use your result to explain why a three-layer network with linear
hidden units cannot solve a non-linearly separable problem such as XOR.
To show that a 3-layer neural network with a linear activation function for the hidden units is
equivalent to a 2-layer network, we need to analyze the structure of the network.
Network Structure
Hidden Layer (h): The hidden layer takes input from the input layer and applies a linear
transformation. Let the weight matrix be ( W1 ) and the bias vector be ( b1 ). The output of the
hidden layer is: [ h = W1 x + b1 ]
Output Layer (y): The output layer takes the hidden layer's output and applies another linear
transformation. Let the weight matrix be ( W2 ) and the bias vector be ( b2 ). The output is: [ y
= W2 h + b2 = W2 (W1 x + b1) + b2 ] Expanding this gives: [ y = W2 W1 x + W2 b1 + b2 ]
Notice that ( W2 W1 ) is a matrix that can be treated as a new weight matrix ( W ), and ( W2 b1
+ b2 ) can be treated as a new bias ( b ). Thus, we can express the output ( y ) as: [ y = W x +
b ] where ( W = W2 W1 ) and ( b = W2 b1 + b2 ).
Conclusion
This shows that a 3-layer network with linear activation functions can be reduced to a single-
layer network (2-layer network) that consists of the input layer and the output layer.
Therefore, it does not add any additional representational power beyond what is provided by
a 2-layer network.
Now, regarding why a 3-layer network with linear hidden units cannot solve a non-linearly
separable problem, such as the XOR problem:
The XOR problem is not linearly separable; it cannot be solved by a single linear decision
boundary. A linear classifier can only separate data points with a single hyperplane.
Since we have established that a 3-layer network with linear activation functions behaves like
a 2-layer network (which is effectively just a linear transformation), it cannot model the
complex decision boundary needed to separate the classes in the XOR problem.
In summary, because both the hidden layer and the output layer of a 3-layer network are
linear, the overall network remains linear and thus cannot solve non-linearly separable
problems like XOR.
Let's calculate the number of parameters in the CNN with the given layers:
The deployment of CNNs in facial recognition systems raises several ethical issues:
• Privacy: Facial recognition can be used for mass surveillance without consent,
leading to potential violations of privacy. It is crucial to establish guidelines for data
collection, storage, and use to protect individuals' privacy.
• Bias: CNN-based facial recognition systems can exhibit bias, especially when trained
on unbalanced datasets. These biases can lead to higher error rates for certain
demographic groups, such as racial or gender minorities, leading to unfair outcomes.
• Societal Impact: The widespread use of facial recognition can lead to negative social
consequences, such as the chilling effect on freedom of expression or the misuse for
authoritarian control.
1. **Backpropagation:**
- Backpropagation is not an optimization algorithm itself but a method for
calculating gradients of the loss function with respect to the weights in a neural
network. It enables the updating of weights using optimization algorithms.
1. **Convergence Speed:**
- **SGD:** Generally requires more epochs to converge because it can be slow to
escape local minima due to its reliance on the learning rate and can oscillate
significantly.
- **Adam:** Tend to converge faster in practice due to its adaptive learning rates
and momentum terms, making it less sensitive to the learning rate setting.
**Evidence:** Studies show that models trained with Adam often achieve lower
training loss more quickly than those trained with SGD, especially on complex tasks
like machine translation.
2. **Final Accuracy:**
- **SGD:** When tuned properly (e.g., using learning rate schedules), it can
achieve competitive final performance. However, it may obtain solutions that are
more sensitive to initialization.
- **Adam:** Typically provides better performance in terms of final accuracy
without extensive hyperparameter tuning.
**Evidence:** In tasks such as text generation, models trained using Adam
consistently surpass those trained with SGD in terms of accuracy and BLEU scores, a
metric used for evaluating machine translation quality.
3. **Robustness to Hyperparameters:**
- **SGD:** Sensitive to the choice of learning rate; requires careful tuning. The
learning rate schedule can dramatically affect performance.
- **Adam:** More robust to hyperparameter settings, often producing good results
with default parameters, which can save time in experiments.
1. **Gradient Noise:**
- **SGD:** The stochastic nature introduces noise, which can help escape local
minima but may lead to divergent behavior in poorly conditioned landscapes.
- **Adam:** The momentum aspect helps smooth the optimization path, potentially
allowing it to navigate more complex loss surfaces more effectively.
3. **Memory Requirements:**
- **SGD:** Requires less memory as it only stores gradients.
- **Adam:** Requires more memory because it maintains two additional vectors
(the first and second moment estimates) for each parameter, which can be a limitation
for extremely large models or datasets.
### Conclusion
In the context of NLP tasks like machine translation or text generation, Adam is often
preferred due to its faster convergence and higher final accuracy, especially when
computational resources are limited. However, SGD can still perform well when
tuned correctly and may deliver better results in terms of generalization on certain
tasks.
9. Write short notes on the following:
1. Biological neuron
In the human brain, a neuron is a specialized cell that serves as the basic unit
of the nervous system, responsible for transmitting information. A neuron has
three main parts:
• Dendrites: Branch-like structures that receive signals from other neurons.
• Cell Body (Soma): Processes incoming signals from dendrites and, if a threshold is
reached, generates an electrical impulse.
• Axon: A long, slender projection that transmits the impulse away from the cell body
to other neurons, muscles, or glands via the axon terminals.
In artificial neural networks, the biological neuron serves as an inspiration for
artificial neurons (also called nodes or units), where inputs are combined, processed,
and transmitted to other neurons. This forms the foundation of neural network design
in AI.
2. ReLU function
The Rectified Linear Unit (ReLU) function is an activation function
commonly used in deep learning, defined by f(x)=max(0,x).
• For any input x≤0, ReLU outputs 0.
• For x>0, it outputs x.
ReLU introduces non-linearity into the model, which is essential for learning complex
patterns. It also reduces computational overhead because the function is simple to
compute. Compared to functions like sigmoid and tanh, ReLU is less prone to the
vanishing gradient problem, where gradients become too small during
backpropagation, slowing learning. However, ReLU has issues like "dying ReLUs,"
where some neurons output zero for all inputs, effectively turning them off.
5. Recurrent networks
Recurrent Neural Networks (RNNs) are a class of neural networks designed
for processing sequential data, such as time-series data, text, or speech. The
key feature of RNNs is that they have feedback loops that allow information
from previous time steps to be used when processing new data. This makes
them suitable for tasks where the current output depends on previous inputs or
states.
Structure:
• RNNs have a "memory" of previous inputs in the form of hidden states, which are
updated at each time step based on both the current input and the previous hidden
state.
• At each step, the network passes information through hidden layers, and the output
depends on both the current input and the past states.
Challenges:
• Vanishing Gradient Problem: During training, especially with long sequences,
gradients can become very small and cause the network to stop learning effectively.
This issue can be mitigated by using more advanced architectures like LSTMs (Long
Short-Term Memory) or GRUs (Gated Recurrent Units), which are specifically
designed to capture long-range dependencies in the data.
• Exploding Gradients: Sometimes the gradients can become too large, causing the
model's weights to become unstable. This can be managed by gradient clipping.
Applications: RNNs are widely used for tasks that involve sequence
prediction, such as language translation, speech recognition, and stock price
forecasting.
10. Describe the structure of an artificial neuron. How is it similar to a biological neuron?
What are its main components?
The structure of an artificial neuron is modeled after a biological neuron, mimicking its
signal-processing capability. In an artificial neural network, each neuron processes input data
and sends output signals to other neurons, helping to identify patterns and make predictions.
1. Inputs (x): These are the incoming signals, similar to how dendrites receive signals in
a biological neuron. Each input value represents a feature or data point, and multiple
inputs are often used to capture different aspects of the input data.
2. Weights (w): Each input has an associated weight that indicates its importance.
Weights determine the influence of each input on the final output. The weight values
can be positive or negative, and during training, they are adjusted to optimize the
model's performance.
3. Summation Function: The weighted inputs are summed up to produce a single value.
This is similar to the cell body (or soma) in a biological neuron that integrates
incoming signals. Mathematically, the summation is represented as:
𝑛
z = ∑ wixi + b
𝑖=1
4. Bias (b): The bias is an additional parameter that allows the neuron to produce a non-
zero output even if all inputs are zero. It shifts the activation function, helping the
network model patterns that don’t pass through the origin.
5. Activation Function: After summing the weighted inputs and adding the bias, the
result is passed through an activation function to introduce non-linearity. This
function determines whether the neuron should be "activated" (fired) and how
strongly. Common activation functions include:
o Sigmoid: Outputs values between 0 and 1, useful for binary classification.
o ReLU (Rectified Linear Unit): Outputs zero for negative values and the input
itself for positive values, introducing non-linearity effectively.
o Tanh: Outputs values between -1 and 1, often used in hidden layers of neural
networks.
6. Output (y): The final output of the neuron is the result of the activation function. This
output may serve as the input to another neuron or layer in the network, allowing
information to propagate through the network.
Similarities to a Biological Neuron
1.
2. On one bank of a river are 3 missionaries and 3 cannibals. There is 1 boat available that can
carry at most 2 people and that they would like to use to cross the river. If the cannibals ever
outnumber the missionaries on either of the river's banks or on the boat, the missionaries
will get eaten. How can the boat be used to carry all the missionaries and cannibals across
the river safely? The boat cannot cross the river by itself with no people on board and there
is no island in the middle of the river.
To solve this classic problem, we need to ensure that at any point on either bank, the number
of cannibals never exceeds the number of missionaries, or the missionaries will be eaten.
Here’s a step-by-step safe solution to transport everyone across without risking anyone’s
safety.
We’ll label the two sides of the river as the starting side (left bank) and the destination side
(right bank). Let’s denote:
• MMM: Missionaries
• CCC: Cannibals
• BBB: Boat
Initial Setup
• Left bank: 3M, 3C, 1B
• Right bank: 0M, 0C
Steps
Now all missionaries and cannibals are safely across the river.
3.
To find the most cost-effective path from A to G using the A* algorithm, we’ll use the
formula:
f(n)=g(n)+h(n)
where:
Step-by-Step Execution
1. Start at A:
o g(A)= 0
o h(A)= 11
o f(A)=g(A)+h(A)=0+11=11
2. Move to B:
o g(B)=g(A)+cost(A,B)=0+2=2
o h(B)= 6
o f(B)=g(B)+h(B)=2+6=8
Move to E:
o g(E)=g(A)+cost(A,E)=0+3=3
o h(E)= 7
o f(E)=g(E)+h(E)=3+7=10
Move to C:
o g(C)=g(B)+cost(B,C)=2+1=3
o h(C)= 99
o f(C)=g(C)+h(C)=3+99=102
Move to D:
o g(D)=g(B)+cost(B,D)=2+9=11
o h(D)= 3
o f(D)=g(D)+h(D)=11+3=14
Move to D:
o g(D)=g(E)+cost(E,D)=3+6=9
o h(D)= 3
o f(D)=g(D)+h(D)=9+3=12
Now, the path D from E has the lowest f-value (12), so we choose this D next.
Move to G:
o g(G)=g(D)+cost(D,G)=9+1=10
o h(G)= 0
o f(G)=g(G)+h(G)=10+0=10
Solution Path
A→E→D→ G
4. Suppose you are designing an AI agent that plays a two-player game using the
minimax algorithm. How would you explain the concept of alpha-beta pruning and
how it optimizes the algorithm by reducing the number of nodes explored in a game
tree? Additionally, how can you implement alpha-beta pruning in your agent, and
what are some potential limitations of this technique?
Alpha-beta pruning is an optimization technique used in the minimax algorithm to
reduce the number of nodes the algorithm needs to evaluate in a game tree. This
technique leverages two bounds, alpha and beta, to eliminate branches of the game
tree that don’t need to be explored because they cannot influence the final decision.
In a minimax tree, where the algorithm aims to find the best move by maximizing the
minimum gains (for a maximizing player) or minimizing the maximum losses (for a
minimizing player), alpha-beta pruning works as follows:
• Alpha: The best score that the maximizing player can guarantee at that point or
higher.
• Beta: The best score that the minimizing player can guarantee at that point or lower.
When exploring a node in the tree:
• If the maximizing player finds a move with a value greater than or equal to beta, it
stops considering other moves at that node because the minimizing player will never
allow reaching that branch (it would choose an alternative path with a lower value).
• Conversely, if the minimizing player finds a move with a value less than or equal to
alpha, it stops considering other moves at that node because the maximizing player
will not allow reaching that branch (it would choose an alternative path with a higher
value).
By pruning branches of the tree that cannot affect the outcome, alpha-beta pruning
reduces the search space significantly, allowing the algorithm to examine only
relevant moves. This means it can reach deeper levels of the tree within the same time
constraints, leading to more accurate evaluations.
Implementation:
def minimax(node, depth, is_maximizing, alpha, beta):
if depth == 0 or is_terminal(node):
return evaluate(node)
if is_maximizing:
max_eval = -infinity
for child in get_children(node):
eval = minimax(child, depth - 1, False, alpha, beta)
max_eval = max(max_eval, eval)
alpha = max(alpha, eval)
if beta <= alpha:
break # Beta cut-off
return max_eval
else:
min_eval = infinity
for child in get_children(node):
eval = minimax(child, depth - 1, True, alpha, beta)
min_eval = min(min_eval, eval)
beta = min(beta, eval)
if beta <= alpha:
break # Alpha cut-off
return min_eval
Benefits of Alpha-Beta Pruning
• It can reduce the time complexity of minimax from O(b^d) to O(b^{d/2}), where b is
the branching factor and d is the depth of the tree. This effectively doubles the depth
the algorithm can explore within the same time limit.
• It allows the agent to explore deeper levels in the game tree, potentially leading to
better moves.
5. Let b be the branching factor of a search tree. If the optimal goal is reached after d
actions from the initial state, in the worst case, how many times will the initial state be
expanded for iterative deepening depth first search (IDDFS) and iterative Deepening
A* search (IDA*)?
To analyze the worst-case number of expansions of the initial state for Iterative Deepening
Depth-First Search (IDDFS) and Iterative Deepening A (IDA)**, let’s look at how each
algorithm operates in the worst case, assuming a branching factor b and depth d for the
optimal solution.
• The initial state will be expanded once for each depth limit, from 1 up to d.
• Therefore, the initial state will be expanded d times in total.
IDA* works similarly to IDDFS but uses a cost threshold rather than a depth limit. It starts
with an initial threshold (often the heuristic value of the initial state) and increases this
threshold iteratively as it fails to find the goal within the current cost limit.
• IDA* will expand the initial state at every iteration as it increases the cost threshold.
• For each threshold, it could potentially expand the initial state, so the worst-case
number of times it expands the initial state will depend on the number of unique
threshold values it explores until it reaches the goal.
• Typically, if we assume each threshold grows linearly (which is common with simple
heuristic increments), the initial state could be expanded around ddd times, similar
to IDDFS.
However, note that with some heuristics, the number of threshold increments could be
slightly more or less than d, but in the worst case, the initial state would be expanded
approximately d times for both IDDFS and IDA*.
Worst-case expansions of the initial state for IDA:* Approximately ddd times.
6.
To determine which subtrees are pruned due to alpha-beta pruning in this game tree, we need
to evaluate the tree step-by-step while applying alpha (lower bound for the maximizing
player) and beta (upper bound for the minimizing player) cutoffs.
Let’s proceed through each level in the tree, applying alpha-beta pruning rules:
Given:
Let's calculate the error rate, sensitivity, precision, and F-measure of the model.
a) Error Rate
The error rate is the ratio of incorrect predictions to the total number of predictions.
Sensitivity measures the proportion of actual positives that are correctly identified.
Sensitivity=TP/TP+FN=15/15+3=15/18≈0.833 or 83.3%
c) Precision
Precision measures the proportion of positive predictions that are correct.
Precision=TP/TP+FP=15/15+7=15/22≈0.682 or 68.2%
F1 Score=2×((Precision×Sensitivity)/(Precision+Sensitivity))=
2×(0.682×0.833/0.682+0.833)≈0.75 or 75%
2. (a) What is under fitting in context of machine learning models? What is the major
cause of under fitting? (b) What is overfitting? When does it happen? (c) Explain
when over fitting happens in a model.
(a) Underfitting
Definition: Underfitting occurs when a machine learning model is too simple to capture the
underlying patterns in the data. This leads to poor performance on both training and test data.
Causes: The major causes of underfitting include a model that is too simple (e.g., using linear
regression for non-linear data), insufficient features, or excessive regularization that overly
restricts the model's complexity.
(b) Overfitting
Definition: Overfitting happens when a model learns not only the underlying pattern in the
data but also the noise or random fluctuations. This leads to excellent performance on
training data but poor generalization on test data.
Causes: Overfitting typically occurs when the model is too complex for the amount of data
available (e.g., a deep neural network with too many layers for a small dataset), or when it is
trained for too many epochs, capturing noise rather than meaningful patterns.
Overfitting occurs in a model when it learns the training data too well, capturing noise and
fluctuations that do not generalize to new, unseen data. This typically happens when:
1. The model is too complex: If the model has too many parameters or is highly flexible
(e.g., deep neural networks with many layers or a polynomial regression with high
degree), it can "memorize" the training data, leading to poor performance on new
data.
2. Training data is limited or unrepresentative: When there is not enough data, or if
the training data doesn't represent the real-world variations adequately, the model may
latch onto specific patterns that are actually just noise.
3. Lack of regularization: Regularization techniques like L1/L2 penalties or dropout in
neural networks help constrain the model's ability to learn complex patterns. Without
these, the model is more likely to overfit.
3. An antibiotic resistance test (random variable T) has 1% false positives (i.e. 1% of
those not resistance to an antibiotic show positive result in the test) and 5% false
negatives (i.e. 5% of those actually resistant to an antibiotic test negative). Let us
assume that 2% of those tested are resistant to antibiotics. Determine the
probability that somebody who
Given:
Let's denote:
where:
Now substitute:
P(Positive)=(0.95×0.02)+(0.01×0.98)=0.019+0.0098=0.0288
4. Discuss the impact of class imbalance on the confusion matrix and how metrics
derived from it can be misleading.
Class imbalance occurs when one class is significantly more frequent than the other(s). In
such cases:
• Accuracy becomes misleading, as a model that always predicts the majority class can
still achieve high accuracy without actually learning the distinctions between classes.
• Precision and Recall for the minority class may also be low, despite high accuracy.
• F1 Score can be more informative, but even it can be impacted if one class is very
rare.
Class imbalance often necessitates using additional metrics like Precision-Recall AUC,
balanced accuracy, or F1 Score for the minority class.
5 . Invent a new metric derived from the confusion matrix that addresses a specific
limitation of existing metrics (e.g., sensitivity to class imbalance, interpretability).
Define the metric, describe its calculation, and demonstrate its advantages with
theoretical analysis and empirical evidence.
### New Metric: Balanced F1-Score (BF1S)
#### Motivation
The traditional F1-score, which is the harmonic mean of precision and recall, often
suffers in scenarios of class imbalance because it treats both classes equally,
potentially masking poor performance on the minority class. While the weighted F1-
score helps mitigate this to some extent, it still does not fully address the issue of
interpretability and the importance of capturing performance across both classes.
#### Definition
The Balanced F1-Score (BF1S) is designed to provide a more nuanced view of the
model performance, especially in imbalanced datasets. It balances the F1-scores of
both classes by using a geometric mean, which can emphasize the performance of the
minority class while still considering overall accuracy.
#### Calculation
Let’s denote:
The Precision and Recall for each class can be defined as:
1. **Balanced Dataset**: If both classes have the same number of instances and the
model correctly classifies all, BF1S will be equal to 1 (perfect performance). This
indicates that it aligns well with the traditional F1-score.
2. **Imbalanced Dataset (70% Negative, 30% Positive)**: If the model performs well
on the majority class (e.g., Precision and Recall for Negative Class are high) but
poorly on the minority class, the BF1S reflects this by dropping significantly due to
the geometric mean's properties, thus highlighting the poor performance on the
minority class.
By observing the BF1S in these experiments, one can see how it effectively reflects
the true performance of models on imbalanced datasets, encouraging the selection of
models that do not neglect the minority class.
### Conclusion
The Balanced F1-Score (BF1S) can serve as a valuable tool for practitioners dealing
with imbalanced datasets, offering a more holistic view of model performance that is
sensitive to class imbalance while being interpretable and actionable.
6 . Design a comprehensive framework for model evaluation that integrates k-fold
cross-validation with other validation techniques such as leave-one-out cross-
validation and nested cross-validation. Explain the framework and illustrate its
effectiveness with an end- to-end machine learning project.
### Comprehensive Model Evaluation Framework
In the machine learning workflow, ensuring robust model evaluation is crucial for
selecting the best-performing model and avoiding pitfalls like overfitting. This
framework integrates multiple validation techniques—k-fold cross-validation, leave-
one-out cross-validation (LOOCV), and nested cross-validation—to provide a
comprehensive assessment of model performance.
1. **Data Preparation**:
- Start with data cleaning, preprocessing (handling missing values, normalization,
encoding categorical variables), and feature engineering.
- Split the dataset into training and test sets. Typically, 70-80% of data is used for
training, and the remaining 20-30% for testing.
5. **Performance Aggregation**:
- After completing the nested cross-validation, aggregate the performance metrics
from the outer loop. This could involve calculating the mean and standard deviation
of the performance metrics across all outer folds.
8. **Reporting**:
- Compile the performance metrics and visualizations (e.g., confusion matrix, ROC
curve) to provide a comprehensive report of the model evaluation.
1. **Data Preparation**:
- Load the dataset containing customer features (e.g., age, service usage) and churn
labels (churned or not).
- Clean the data by handling missing values, converting categorical variables into
dummy variables, and scaling numerical features.
2. **Initial Split**:
- Split the data into a training set (80%) and a test set (20%).
5. **Performance Aggregation**:
- After all outer folds are complete, calculate the mean and standard deviation of
performance metrics (like accuracy, precision, recall, F1-score) across the outer folds.
7. **Testing**:
- Evaluate the final model on the 20% test set and gather metrics such as accuracy,
confusion matrix, and ROC curve.
8. **Reporting**:
- Create a comprehensive report that includes visualizations and discussions about
model performance, insights gained from the evaluation, and any potential biases
detected.
### Effectiveness of the Framework
### Conclusion
### Varying k
- Run the cross-validation with different values of k (e.g., k=5, 10, 15).
- For each value of k, compute and average the performance metrics over all folds.
2. **Computational Cost:**
- Increasing k increases the number of times the model is trained, leading to higher
computational costs.
3. **Bias-Variance Tradeoff:**
- A smaller k may lead to lower variance but higher bias as it relies on fewer
samples for evaluation.
- A larger k typically yields lower bias but higher variance, making model
evaluation less stable.
4. **Best Practices:**
- It’s often recommended to use k=10 as a compromise between bias and variance.
- Perform experiments across multiple values of k and compare results to ensure
robustness.
### Conclusion
By systematically comparing the performance of different algorithms using k-fold
cross-validation with varying k, you can gain insights into the reliability and
robustness of model evaluation. This method enables thorough understanding and
selection of the most suitable algorithm for your datasets.
8. Compare and contrast the advantages of confusion matrices with other performance
evaluation methods such as ROC curves and precision-recall curves in assessing the
predictive power of multi-class classification models. Provide empirical evidence and
theoretical insights into when each method is most appropriate for different types of
classification tasks and dataset characteristics.
Confusion Matrix
• Advantages:
o Clear representation of actual versus predicted outcomes across multiple
classes.
o Allows computation of a variety of metrics that can help understand both per-
class and overall performance.
o Easy to interpret when analyzing misclassifications and understanding where
specific classes are confused with each other.
• Limitations:
o Not well-suited for highly imbalanced datasets, as accuracy can be misleading.
o In multi-class settings with numerous classes, confusion matrices can become
large and difficult to interpret.
ROC Curve
The Receiver Operating Characteristic (ROC) curve plots the true positive rate
(sensitivity) against the false positive rate for a binary classifier, allowing us to observe the
trade-off between sensitivity and specificity at different threshold levels. For multi-class
classification, the ROC curve can be extended using one-vs-all or one-vs-one strategies.
• Advantages:
o Provides a comprehensive view of model performance across various
threshold levels.
o ROC AUC (Area Under the Curve) is a useful summary metric to evaluate
overall performance.
o Suitable for balanced datasets and when false positives and false negatives are
equally important.
• Limitations:
o Not ideal for highly imbalanced datasets because the false positive rate is less
informative when there are few positive instances.
o In multi-class classification, interpreting multiple one-vs-all or one-vs-one
ROC curves can be complex and less intuitive.
Precision-Recall Curve
The precision-recall (PR) curve is more suitable for imbalanced datasets, as it focuses on
precision (positive predictive value) and recall (sensitivity). The PR curve is especially
informative when the positive class is rare, as it does not take true negatives into account
(thus making it more sensitive to the performance of the positive class).
• Advantages:
o Effective in highlighting performance on rare positive classes, particularly for
imbalanced datasets.
o The PR AUC (Area Under the Precision-Recall Curve) provides a summary
measure that is often more informative than ROC AUC in imbalanced
scenarios.
• Limitations:
o Less interpretable for balanced datasets or when false negatives are not as
critical.
o Requires threshold adjustment and is often more complex to analyze in multi-
class settings, particularly when each class requires a separate PR curve.
• Confusion Matrix: Best used for balanced multi-class tasks where per-class errors
are important to understand. Ideal for model tuning based on error patterns.
• ROC Curve: Suitable for binary classification and balanced datasets. In multi-class
classification, ROC curves are helpful if multiple one-vs-one or one-vs-all
comparisons are feasible and valuable.
• Precision-Recall Curve: Most appropriate for imbalanced datasets or cases where
detecting the positive class is critical, as in medical diagnostics or rare event
prediction.
Consider a scenario with a dataset having 1% positive cases and 99% negative cases:
• ROC AUC might show a high value even if the model performs poorly on the
positive class because it accounts for true negatives, which dominate.
• PR AUC would provide a clearer picture, as it focuses on positive predictions. In
imbalanced tasks, PR AUC often shows a model’s true performance on rare classes
more accurately.
9. Evaluate the ethical implications of training data collection and labelling processes in
developing machine learning models for facial recognition technology. Discuss the
challenges of bias, diversity representation, and privacy considerations in the creation
and usage of training datasets. Propose strategies to enhance fairness and
accountability in training data practices for facial recognition systems.
The ethical implications of data collection and labeling in facial recognition are significant, as
this technology is widely used in sensitive applications. Let’s address these implications
across three key areas: bias, diversity representation, and privacy.
Challenges
1. Bias and Diversity Representation:
o Issue: Facial recognition models often exhibit bias when trained on non-
representative data. If the training set lacks diversity (e.g., under-representation of
certain racial or gender groups), the model’s accuracy can vary widely across
demographics, leading to disproportionately high error rates for some groups.
o Impact: This bias can result in unfair treatment, such as higher misidentification rates
for certain groups, which can perpetuate inequality and harm marginalized
communities.
o Evidence: Studies have shown that facial recognition algorithms often misidentify
individuals from ethnic minorities at higher rates than those from majority groups.
For instance, in a study by the National Institute of Standards and Technology (NIST),
it was found that algorithms performed significantly worse on African-American and
Asian faces compared to Caucasian faces.
2. Privacy Concerns:
o Issue: Facial recognition datasets often consist of images of individuals captured
without consent, raising privacy concerns. Even when publicly available images are
used, the individuals in those images may not be aware that their data is being used
for training AI systems.
o Impact: Unauthorized use of personal data infringes on individual privacy rights and
can lead to public mistrust and backlash against AI applications.
o Evidence: In recent years, several high-profile cases have involved companies being
sued for using individuals' images without consent to train facial recognition models,
violating privacy laws such as the GDPR in Europe and BIPA (Biometric Information
Privacy Act) in Illinois.
Strategies to Enhance Fairness and Accountability
1. Data Collection and Diversity Audits:
o Approach: Conduct diversity audits on training datasets to ensure balanced
representation across demographic groups, including gender, age, and ethnicity.
o Implementation: Prioritize data collection that actively includes diverse individuals
and verify demographic balance throughout the data pipeline.
o Advantage: Helps to mitigate bias in model predictions, reducing the likelihood of
harm to underrepresented groups.
2. Transparency and Consent in Data Usage:
o Approach: Implement strict informed consent protocols for data collection, ensuring
that individuals are aware of how their data will be used.
o Implementation: Use anonymization techniques to protect individual identities and
only collect data from users who have explicitly agreed to be included.
o Advantage: Reduces the risk of privacy infringements and builds public trust in AI
technology.
3. Fairness-Aware Algorithms:
o Approach: Integrate fairness constraints and bias detection mechanisms directly into
model training and evaluation.
o Implementation: Use fairness-aware machine learning techniques, such as re-
weighting samples or adversarial debiasing, to ensure that predictions remain
equitable across demographic groups.
o Advantage: Prevents biases from arising in model output, enhancing accountability
in facial recognition systems.
4. Third-Party Audits and Accountability Frameworks:
o Approach: Encourage independent third-party audits of facial recognition models to
verify ethical compliance and bias mitigation.
o Implementation: Establish partnerships with regulatory bodies and ethics
committees to evaluate models before deployment.
o Advantage: Holds companies accountable for their practices and helps ensure
models align with ethical standards.
10.
To classify Sayan as a Good, Average, or Poor sprinter using the k-Nearest Neighbors (k-
NN) algorithm, we’ll need to calculate the Euclidean distance between Sayan's attributes
(Weight = 56 kg, Speed = 10 kmph) and the attributes of each athlete in the training data.
Then, we'll use a chosen value of k to determine Sayan's class based on the majority class
among the nearest neighbors.
1. Calculate Euclidean Distance: The Euclidean distance between two points, (x1, y1 )
and (x2, y2 ), is calculated as:
2. Compute Distances for Each Athlete: We’ll calculate the distance between Sayan’s
point (56,10) and each athlete’s point in the table.
3. Determine the Class: After calculating the distances, select the k closest neighbors. If
we choose k=3 (a common choice for small datasets), we’ll classify Sayan based on
the majority class among the 3 nearest neighbors.
Since the classes of the nearest neighbors are "Good," "Average," and "Poor," the majority
class among these three is "Average." Thus, Sayan would be classified as an Average
sprinter.
11. In a software project, the team is trying to identify the similarity of software defects
identified during testing. They wanted to create 5 clusters of similar defects based on the
text analytics of the defect descriptions. Once the 5 clusters of defects are identified, any
new defect created is to be classified as one of the types identified through clustering. Create
this approach through a neat diagram. Assume 20 Defect data points which are clustered
among 5 clusters and k-means algorithm was used.
To illustrate this approach, I’ll describe and create a diagram that visualizes the process of
clustering software defects using k-means, where we start with 20 defect data points and
group them into 5 clusters based on their similarities.
1. Clustering of Defects: Using the k-means algorithm, the defect descriptions are
grouped into 5 clusters based on their similarity (determined through text analytics,
like TF-IDF or word embeddings).
2. Classification of New Defects: Once the clusters are created, any new defect is
classified as one of these 5 types by determining the cluster to which it is closest.
Here is the diagram illustrating the k-means clustering approach for classifying software
defects based on text similarity. This visual represents the clustering of 20 defect data points
into 5 clusters and the classification process for a new defect.
12. a. Discuss the major drawbacks of K-nearest Neighbour algorithm and how it can be
corrected.
b. A sample from class-A is located at (X, Y, Z) = (1, 2, 3), a sample from class-B is at (7, 4, 5)
and a sample from class-C is at (6, 2, 1). How would a sample at (3, 4, 5) be classified using
the Nearest Neighbour technique and Euclidean distance?
• Computational Complexity:
• Curse of Dimensionality:
• Drawback: When there are many features (high-dimensional data), the distances
between points tend to become similar, reducing the algorithm’s effectiveness.
• Solution: Dimensionality reduction techniques like Principal Component
Analysis (PCA) or t-SNE can help by reducing the number of features while
retaining important information. Alternatively, feature selection can help in reducing
irrelevant or redundant features.
• Drawback: k-NN does not automatically select the most important features, which
can lead to noisy or irrelevant features affecting the classification.
• Solution: Feature selection methods (like mutual information, correlation-based
feature selection) or feature scaling (like normalization) can help to focus the
algorithm on the most relevant features.
• Imbalanced Data:
• Drawback: k-NN can struggle with class imbalance. The algorithm may be biased
toward the majority class if most of the nearest neighbors are from it.
• Solution: Adjusting the distance measure or using weighted k-NN, where closer
neighbors have more influence on the decision, can help with imbalanced data. Also,
using techniques such as SMOTE (Synthetic Minority Over-sampling Technique) to
balance the dataset can improve performance.
For this part, we will use the Euclidean distance formula to determine the nearest class for the
point (3,4,5).
Given Points:
1. Class A: (1,2,3)
2. Class B: (7,4,5)
3. Class C: (6,2,1)
4. Query Point: (3,4,5)
The Euclidean distance d between two points (x1,y1,z1) and (x2,y2,z2) is given by:
Calculations:
Since the closest distance is 3.46 (to Class A), the sample at (3,4,5)(3, 4, 5)(3,4,5) would be
classified as Class A using the nearest neighbor technique.
13.
Let's use the Naive Bayes algorithm to classify the two given messages based on whether they are
Normal or Spam.
Given Data
• Total messages: 12 (8 Normal, 4 Spam).
• P(Normal) = 8/12=0.67
• P(Spam) = 4/12=0.33
To avoid zero probabilities, we'll apply Laplace smoothing, assuming each word can appear at least
once in both Normal and Spam messages.
Let:
=0.67×0.45×0.3=0.09045
=0.33×0.25×0.17=0.014025
Since P(Normal | Dear, Friend) > P(Spam | Dear, Friend), the message with "Dear Friend" is
classified as Normal.
Since P(Spam | Friend, Money) > P(Normal | Friend, Money), the message with "Friend Money" is
classified as Spam.
Summary of Results
14.
To calculate the Support, Confidence, Lift, Leverage, and Conviction for the association
rule {Butter, Bread}⇒Milk, let's analyze the data provided in the table.
Transaction Analysis
Lift=Confidence/Support of Milk=5/2=2.5
Conviction=1−Support of Milk1−Confidence=1−0.41−1=undefined
Summary of Results
• Support: 0.2
• Confidence: 1.0
• Lift: 2.5
• Leverage: 0.12
• Conviction: Undefined (due to 100% confidence in this case)
15. Given a dataset with two features, (x1, x2) and two classes y = {-1, 1}. Suppose the
optimal separating hyperplane found by the SVM is defined by the equation 0.5x1
+0.75x2-1=0 Find the margin of the hyperplane.
The margin of an SVM hyperplane is defined as the distance between the hyperplane and the
closest points from either class, also known as the support vectors. For a hyperplane given by
w⋅x+b=0, where w is the weight vector and b is the bias, the margin M is calculated as:
M=∥w∥^2
Step-by-Step Solution
0.5x1+0.75x2−1=0
In this equation:
o
w=(0.5,0.75)
b=−1
o
2. Calculate ∥w∥:
M=∥w∥/2=20.9014≈2. 218
Answer:
16.
To determine which data points are support vectors, we need to check the distance of each
point to the hyperplane. Support vectors lie on the margins, meaning they satisfy
yi(w⋅xi+b)= 1.
Given Information
x1+x2−3=0
w⋅x+b=1×1+1×2−3=0
y×(w⋅x+b)=1×0=0
w⋅x+b=1×2+1×1−3=0
y×(w⋅x+b)=1×0=0
w⋅x+b=1×2+1×3−3=2
y×(w⋅x+b)=−1×2=−2
w⋅x+b=1×3+1×3−3=3
y×(w⋅x+b)=−1×3=−3
Conclusion
Support vectors are those points that satisfy y×(w⋅x+b)= 1. Here, none of the points meets
this exact condition, but points that are closest to this condition, such as those with
y×(w⋅x+b)= 0, are considered support vectors in practical applications.
Thus, points (1, 2, 1) and (2, 1, 1) are likely the support vectors, as they lie closest to the
decision boundary x1+x2=3x1+x2=3.
17.
To calculate the entropy of this dataset with respect to the target function classification, we
need to determine the proportions of positive and negative classifications.
• P(+)=3/6=0. 5
• P(−)=3/6=0.5
Applying this:
H =−(0.5⋅log20.5+0.5⋅log20.5)
Since log20.5=−1:
H=−(0.5⋅−1+0.5⋅−1)=−(−0.5−0.5)= 1
Conclusion
The entropy of this collection of training examples with respect to the target function
classification is 1. This indicates a high level of uncertainty, as there is an equal distribution
of positive and negative classifications.
18. Consider a binary classification problem where we have 200 instances in total, evenly
distributed between two classes (100 instances per class). We build a decision tree that
perfectly classifies the training data without any errors. What is the Gini impurity of the final
leaf nodes of this decision tree?
The Gini impurity of a node in a decision tree measures the probability of misclassifying a
randomly chosen element from that node if it were randomly labeled according to the
distribution of labels in the node.
In this case, since the decision tree perfectly classifies the training data without any errors,
each final leaf node contains instances from only one class (either 100% class 1 or 100%
class 2). Therefore, for each leaf node:
The Gini impurity G for a node where all instances belong to a single class is calculated as
follows:
where pi is the proportion of instances belonging to class i in the node. In this case, p1=1 and
p2=0 (or vice versa), so:
Conclusion:
The Gini impurity of the final leaf nodes of this perfectly classified decision tree is 0. This
reflects a pure node where all instances belong to a single class, indicating no impurity.
19. Suppose we have a dataset with 100 instances and 5 features. We decide to build a
decision tree classifier. During training, the algorithm splits the data based on the
feature that provides the best information gain at each node. If the tree has a depth of
4, how many nodes will the decision tree have in total?
For a binary decision tree with depth d, the total number of nodes can be calculated as:
20. A decision tree classifier learned from a fixed training set achieves 100% accuracy on
the test set. Which algorithms trained using the same training set is guaranteed to give
a model with 100% accuracy?
If a decision tree classifier achieves 100% accuracy on the test set, this suggests that the data
is linearly separable or perfectly partitioned by the feature splits the decision tree has chosen.
However, the guarantee of achieving 100% accuracy with other algorithms depends on their
ability to make perfect partitions of the training data.
Here are some algorithms that are likely to achieve similar accuracy under these conditions:
Important Caveat: Not all algorithms will achieve 100% accuracy on the test set, even if the
decision tree does. Algorithms like logistic regression or neural networks, which rely on
different underlying assumptions and structures, may not capture the same partitioning as a
decision tree, especially if the data isn't naturally linearly separable.
So, the algorithms most likely to guarantee 100% accuracy, assuming perfect partitioning,
would be:
21. Given a dataset with K binary value attributes (K>2) for a two-class classification
task. How will you estimate the number of parameters for learning a Naïve Bayes
Classifier and what will be the number?
To estimate the number of parameters for a Naïve Bayes classifier with KKK binary
attributes for a two-class classification task, we need to consider the following:
1. Class Probabilities:
Since it’s a two-class problem, we need to estimate the probability of each class,
P(Y=1) and P(Y=0). This requires 1 parameter (since P(Y=1) and
P(Y=0)=1−P(Y=1)
2. Conditional Probabilities for Each Attribute:
For each of the K binary attributes Xi where Xi∈{0,1}, we need to estimate:
o P(Xi=1∣Y=1) and P(Xi=0∣Y=1)
o P(Xi=1∣Y=0) and P(Xi=0∣Y=0)
Since each binary attribute requires 2 probabilities per class (but only 1 unique
parameter, as the other is the complement), we need K×2= 2K parameters for the
conditional probabilities across both classes.
1+2K
Example
1+2×3=1+6=7
So, in general, for a Naïve Bayes classifier with K binary attributes and two classes, the
number of parameters required is 1+2K.
1. Describe the structure of an artificial neuron. What are its main components?
An artificial neuron typically consists of the following main components:
• Dendrites: Inputs or signals received from other neurons.
• Summation Junction (or Cell Body): This is where inputs are processed. It
sums the incoming signals.
• Activation Function: Determines if the neuron should be activated, based on
the aggregated input.
• Axon: Transmits the output signal to other neurons or to the output layer in a
network.
2. What is the function of a summation junction of a neuron? What is threshold
activation function?
The summation junction computes the weighted sum of incoming signals. If this sum
exceeds a specified threshold, the neuron activates and sends a signal to the next
layer. The threshold activation function allows only neuron activation when inputs
cross the threshold.
3. What is a step function? What is the difference between step function and
threshold-based activation function?
A step function outputs a fixed value (often 0 or 1) based on whether the input
exceeds a certain threshold. The key difference between a step function and other
threshold-based activation functions is that the step function is binary and does not
vary gradually, while other functions like the sigmoid provide a smooth transition.
4. Why should activation functions be non-linear (in most cases) differentiable?
Non-linear activation functions allow networks to learn complex patterns and
relationships within data. Differentiability is crucial for optimizing the model via
gradient descent, enabling the calculation of gradients for backpropagation.
5. What is the constraint of a simple perceptron? Why may it fail with a real-world
dataset?
A simple perceptron is limited to linearly separable data. It may fail on real-world
datasets that are non-linearly separable, such as those involving the XOR problem.
6. Explain the XOR problem in case of a simple perceptron.
The XOR problem illustrates that a simple perceptron cannot classify points
belonging to the XOR function, where the outputs are not linearly separable. A more
complex structure, such as a multi-layer perceptron, is required to handle this.
7. Explain the basic structure of a multi-layer perceptron. Explain how it can solve
the XOR problem.
A multi-layer perceptron consists of an input layer, one or more hidden layers, and an
output layer. By adding hidden layers, the MLP can learn non-linear mappings,
enabling it to solve the XOR problem by creating decision boundaries that correctly
classify the inputs.
8. What are some thumb rules that can be used for selecting activation functions?
Some thumb rules for selecting activation functions include:
• Use ReLU for hidden layers due to its efficiency and sparsity.
• Use sigmoid or softmax for the output layer in binary classification problems.
• Choose tanh for outputs expected to center around zero.
9. Show mathematically why the derivative of a Sigmoid function is very low when
the value of z is very large or very small.
The derivative of the sigmoid function σ(z)=11+e−zσ(z)=1+e−z1 approaches zero
as ∣z∣∣z∣ becomes large, indicating that the function saturates at 0 or 1, which makes
learning slow due to minimal gradient.
10. Write short notes on: (a) Single-layer feed forward ANN (b) Learning rate.
(a) A single-layer feedforward ANN consists of an input layer connected directly to
an output layer without any hidden layers, typically suited for linear problems.
(b) The learning rate is a hyperparameter that controls how much to change the model
parameters in response to the estimated error during each update in training.
11. Consider a fixed weight vector w and show that the input vector x that maximizes the
scalar product wTx, subject to the constraint that x2 is constant, is given by x =αw for
some scalar α.
To show that the input vector x that maximizes the scalar product wTx, subject to the
constraint that ∥x∥2 is constant, is given by x=αw for some scalar α, we can proceed as
follows:
Problem Setup
Given:
• A fixed weight vector w.
• A constraint that ∥x∥2 is constant, say ∥x∥2 = c for some constant c.
We want to maximize the scalar product wTx subject to this constraint.
Solution
1. Formulate the Objective Function: We wish to maximize wTx with respect to x.
2. Set Up the Constraint: The constraint is ∥x∥2=c, which is equivalent to xTx=c.
3. Lagrange Function: To solve this constrained optimization problem, we can use the
method of Lagrange multipliers. Define the Lagrangian function:
L(x,λ)=wTx−λ(xTx−c)
where λ is the Lagrange multiplier associated with the constraint xTx=c.
4. Take the Gradient: To find the stationary points, we take the gradient of L with
respect to x and set it to zero:
∇xL=w−2λx=0
which implies:
w=2λx
or equivalently,
x=w/2λ
5. Determine λ Using the Constraint: Substitute x= 2λw into the constraint xTx=c:
(w/2λ)T(w/2λ)=c
wTw/4λ2=c
Solving for λ, we get:
λ=±∥w∥2c1/2
6. Solution for x: Substituting λ back into the expression x=w/2λ, we get:
x=±c1/2 w/ ∥w∥
which can be written as:
x=αw
where α=± c1/2 /∥w∥ is a scalar.
Conclusion
Thus, the vector x that maximizes wTx subject to the constraint ∥x∥2=c is given by:
x=αw
for some scalar α=± c1/2 /∥w∥.
12. If an image I has JK pixels and a filter K has L M elements, where a convolution
would be defined by C(j, k) = ΣΣ I(j-l, k-m)K(l,m). (10.19) Write down the limits for
the summations in (10.19). Show that (10.19) can be written in the equivalent 'flipped'
form C(j, k) = ΣΣΙ(j+l,k+m)K(l,m) and again write down the limits for the
summations.
13. In mathematics, a convolution for a continuous variable x is defined by F(x) = lim
G(y)k(x −y)dy where k(x − y) is the kernel function. By considering a discrete
approximation to the integral, explain the relationship to a convolutional layer,
defined by (10.19), in a CNN.
14. Consider an image of size J × K that is padded with an additional P pixels on all sides
and which is then convolved using a kernel of size M × M where M is an odd number.
Show that if we choose P = (M −1)/2, then the resulting feature map will have size J ×
K and hence will be the same size as the original image.
15. Show that if a kernel of size M×M is convolved with an image of size J×K with
padding of depth P and strides of length S then the dimensionality of the resulting
feature map is given by (10.5)
16. For each of the 16 layers in the VGG-16 CNN shown in Figure 10.10, evaluate (i) the
number of weights (i.e., connections) including biases and (ii) the number of
independently learnable parameters. Confirm that the total number of learnable
parameters in the network is approximately 138 million.
17. In this exercise we use one-dimensional vectors to demonstrate why a con volutional
up-sampling is sometimes called a transpose convolution. Consider a one-dimensional
strided convolutional layer with an input having four units with ac tivations (x1,
x2,x3,x4), which is padded with zeros to give (0,x1,x2,x3,x4,0), and a filter with
parameters (w1,w2,w3). Write down the one-dimensional activa tion vector of the
output layer assuming a stride of 2. Express this output in the form of a matrix A
multiplied by the vector (0,x1,x2,x3,x4,0). Now consider an up-sampling convolution
in which the input layer has activations (z1,z2) with a filter having values (w1,w2,w3)
and an output stride of 2. Write down the six dimensional output vector assuming that
overlapping filter values are summed and that the activation function is just the
identity. Show that this can be expressed as a matrix multiplication using the
transpose matrix AT