ML Interview Questions
ML Interview Questions
Learning
Key Concepts:
1. Supervised Learning:
Data Requires labeled data, which Can work with raw, unlabeled
Requirement can be expensive. data, more abundant.
Key Points:
1. Sparse Data Distribution:
Key Characteristics:
1. High Variance:
The model has learned the noise and random fluctuations in the training
data, resulting in high variance.
2. Complex Model:
Loading [MathJax]/extensions/Safe.js
Overfit models are often overly complex, with too many parameters
relative to the amount of training data.
3. Poor Generalization:
The model performs well on the training data but fails to generalize to
unseen data, leading to poor performance in real-world scenarios.
Underfitting:
Underfitting occurs when a machine learning model is too simple to capture the
underlying structure of the data. As a result, the model performs poorly on both
the training data and new, unseen data. Underfitting typically indicates that the
model is not able to learn from the training data effectively.
Key Characteristics:
1. High Bias:
The model is too simplistic and unable to capture the underlying patterns
in the data, resulting in high bias.
2. Too Simple Model:
Underfit models are often too simple or have too few parameters to
adequately represent the complexity of the data.
3. Poor Performance:
The model performs poorly on both the training data and unseen data,
indicating a lack of ability to learn from the data effectively.
overunderfit
Key Characteristics:
1. Output Variable:
Regression:
Regression is also a type of supervised learning task, but unlike classification,
the goal is to predict a continuous numeric value. The output variable is
continuous, and the model learns to map input features to a continuous range of
output values.
Key Characteristics:
1. Output Variable:
Key Differences:
1. Output Variable Type:
Loading [MathJax]/extensions/Safe.js
Classification is commonly used in tasks where the output is categorical
or involves making binary or multi-class decisions. Regression is used in
tasks where the output is continuous and involves predicting a quantity
or value.
classreg
Loading [MathJax]/extensions/Safe.js
that every data point is used for both training and evaluation,
maximizing the utilization of the available information.
The dataset is divided into k equal-sized folds. The model is trained and
evaluated k times, each time using a different fold as the test set and
the remaining folds as the training set.
2. Stratified K-Fold Cross-Validation:
Each data point is used as the test set once, with the rest of the data
used for training. This approach is computationally expensive but
provides a less biased estimate of model performance, especially for
small datasets.
crossval
Recall:
Recall, also known as sensitivity or true positive rate, is a measure of the
completeness of the positive predictions made by a classification model. It
represents the proportion of true positive predictions among all actual positive
instances in the dataset.
F1 Score:
The F1 score is the harmonic mean of precision and recall. It provides a balance
between precision and recall and is a single metric that summarizes the
performance of a classification model. F1 score reaches its best value at 1
(perfect precision
Loading [MathJax]/extensions/Safe.js and recall) and worst at 0.
Key Points:
Precision is important when the cost of false positives is high.
Recall is important when the cost of false negatives is high.
F1 score is useful when there is an uneven class distribution (imbalanced
dataset) as it considers both precision and recall.
High precision means that the model produces fewer false positives, while
high recall means that the model captures most positive instances.
In a confusion matrix, each row represents the actual class, while each column
represents the predicted class. The name "confusion matrix" is derived from the
fact that it makes it easy to see if the system is confusing two classes.
True Positives (TP): The number of instances that were correctly classified
as positive.
True Negatives (TN): The number of instances that were correctly
classified as negative.
False Positives (FP): The number of instances that were incorrectly
classified as positive (i.e., the model predicted positive when the actual class
was negative).
False Negatives (FN): The number of instances that were incorrectly
classified as negative (i.e., the model predicted negative when the actual
class was positive).
Key Concepts:
1. Bias-Variance Tradeoff:
Benefits of Regularization:
Loading [MathJax]/extensions/Safe.js
Prevents Overfitting: Regularization helps prevent overfitting by
discouraging the model from learning overly complex patterns that may not
generalize well to new data.
Improves Generalization: By finding a balance between bias and variance,
regularization improves the model's ability to generalize to unseen data.
Feature Selection: L1 regularization (Lasso) can perform automatic feature
selection by setting some coefficients to zero, which can simplify the model
and improve interpretability.
Training Process:
Boosting:
Approach: Boosting involves iteratively training multiple weak learners
sequentially, where each subsequent model focuses on the instances that
were misclassified by the previous models. It assigns higher weights to
misclassified instances to emphasize their importance.
Training Process:
Loading [MathJax]/extensions/Safe.js
The first base model is trained on the entire training dataset.
Subsequent models are trained on modified versions of the training data,
where the weights of misclassified instances are increased.
The final prediction is made by combining the predictions of all base
models, weighted by their individual performance.
Example Algorithms:
Key Differences:
1. Training Process:
Imgur
Key Concepts:
1. Hyperparameters:
Loading [MathJax]/extensions/Safe.js
Hyperparameters are settings or configurations that are not learned from
the data but rather specified by the practitioner before training the
model. Examples of hyperparameters include the learning rate in
gradient descent, the number of layers in a neural network, and the
depth of a decision tree.
2. Hyperparameter Search Space:
Specify the range or possible values for each hyperparameter that will
be considered during the optimization process.
3. Select an Optimization Method:
Key Concepts:
1. Decision Making:
The decision tree algorithm recursively partitions the feature space into
subsets based on the values of input features, aiming to minimize
impurity or maximize homogeneity within each subset.
3. Interpretability:
Loading [MathJax]/extensions/Safe.js
Decision trees can capture nonlinear relationships and interactions
between features, making them suitable for complex datasets with
nonlinear decision boundaries.
Purpose:
1. Classification:
In classification tasks, decision trees are used to predict the class label of
a sample by traversing the tree from the root node to a leaf node based
on the feature values of the sample.
2. Regression:
Imgur
Key Concepts:
1. Cost or Loss Function:
Loading [MathJax]/extensions/Safe.js
In machine learning, the cost or loss function measures the error
between the predicted values of the model and the actual values in the
training data. The goal of gradient descent is to minimize this error by
finding the optimal set of model parameters.
2. Gradient:
The gradient of the cost function is a vector that represents the direction
of the steepest ascent (positive gradient) or descent (negative gradient)
of the function at a specific point. It points towards the direction of the
greatest rate of increase of the function.
3. Learning Rate:
Optimization Process:
1. Initialization:
Gradient descent starts with an initial guess for the model parameters
(weights and biases), usually chosen randomly or initialized with zeros.
2. Compute Gradient:
At each iteration, the gradient of the cost function with respect to the
model parameters is computed using techniques like backpropagation
for neural networks. This gradient indicates the direction of the steepest
increase in the cost function.
3. Update Parameters:
Role in Optimization:
Minimization of Cost Function: Gradient descent plays a crucial role in
optimizing machine learning models by minimizing the cost or loss function,
which measures the discrepancy between the predicted and actual values.
Loading [MathJax]/extensions/Safe.js
Parameter Updates: By iteratively adjusting the model parameters in the
direction of the negative gradient, gradient descent gradually converges
towards the optimal set of parameters that minimize the cost function.
Slower convergence, as it
Faster convergence, as it updates
requires processing the
Convergence parameters more frequently based
entire dataset in each
on individual instances.
iteration.
More stable but can get Less stable but can escape local
Stability
stuck in local minima. minima due to frequent updates.
Loading [MathJax]/extensions/Safe.js
model. By aggregating the predictions of multiple models, ensemble methods
can often achieve better performance than any single model alone.
Key Concepts:
1. Base Learners:
Base learners are individual models, typically of the same type or diverse
types, trained on different subsets of the training data or with different
algorithms.
2. Aggregation Method:
Imgur
Loading [MathJax]/extensions/Safe.js
16. Explain the concept of a neural network in machine
learning.
A neural network is a powerful machine learning model inspired by the structure
and functioning of the human brain. It consists of interconnected nodes called
neurons organized in layers. Neural networks are capable of learning complex
patterns and relationships in data, making them widely used for various tasks
such as classification, regression, and pattern recognition.
Key Concepts:
1. Neurons:
Neurons are the basic building blocks of a neural network. Each neuron
receives input signals, performs a computation, and produces an output
signal. The output signal is typically passed to the neurons in the next
layer.
2. Layers:
Neural networks are organized into layers, with each layer containing
one or more neurons. The three main types of layers are:
Input Layer: The first layer of the neural network that receives the
input data.
Hidden Layers: Intermediate layers between the input and output
layers where computations are performed. Deep neural networks
have multiple hidden layers.
Output Layer: The final layer of the neural network that produces the
output prediction or classification.
3. Weights and Biases:
Feedforward is the process of passing the input data through the network
to produce predictions. Backpropagation is the process of updating the
weights and biases of the network based on the error between the
Loading [MathJax]/extensions/Safe.js
predicted and actual output, allowing the network to learn from its
mistakes.
Applications:
Image Classification: CNNs are used for tasks such as object detection,
facial recognition, and image segmentation.
Natural Language Processing: RNNs and LSTM networks are used for
tasks such as text generation, machine translation, and sentiment analysis.
Speech Recognition: Neural networks are used to convert spoken
language into text, enabling applications like virtual assistants and voice-
controlled devices.
Financial Forecasting: Neural networks can analyze financial data to
predict stock prices, identify patterns in market trends, and make investment
decisions.
Imgur
Loading [MathJax]/extensions/Safe.js
Generative and discriminative models are two fundamental approaches in
machine learning for modeling the probability distributions of data or making
predictions. They differ in their underlying principles and the tasks they are
suited for.
Generative Models:
Objective: Generative models aim to learn the joint probability distribution
(P(X, Y)) of input features (X) and corresponding labels or outputs (Y).
Discriminative Models:
Objective: Discriminative models aim to learn the conditional probability
distribution (P(Y|X)), which predicts the label or output (Y) given the input
features (X).
Tasks: Discriminative models are commonly used for tasks such as:
Classification.
Regression.
Ranking.
Named Entity Recognition.
Part-of-Speech Tagging.
Examples: Logistic Regression, Support Vector Machines (SVM), Decision
Trees, Random Forests, Gradient Boosting Machines (GBM), and Neural
Networks (in many cases).
Loading [MathJax]/extensions/Safe.js
Key Differences:
1. Data Distribution:
Generative models are versatile and can be used for various tasks
beyond classification and regression, such as data generation and semi-
supervised learning. Discriminative models are primarily used for
classification and regression tasks.
4. Complexity:
Imgur
Reinforcement Learning:
Learning Process: In reinforcement learning, the agent learns by trial and
error through interaction with the environment. It takes actions based on its
current state, receives feedback in the form of rewards or penalties, and
adjusts its behavior to maximize long-term rewards.
Supervised Learning:
Learning Process: In supervised learning, the model learns to map input
data to output labels based on a labeled dataset provided during training.
The goal is to learn a mapping function that generalizes well to unseen data.
Key Differences:
1. Feedback Type:
Imgur
Key Components:
1. States (S):
Actions represent the choices available to the agent at each state. The
action space, denoted by A, contains all possible actions the agent can
take.
3. Transition Probabilities (P):
Dynamics of an MDP:
At each time step t, the agent observes the current state s_t, selects an
action a_t according to its policy π, receives a reward r_t, and transitions to a
new state s_{t+1} according to the transition probabilities P(s_{t+1}|s_t,
a_t).
The goal of the agent is to learn an optimal policy π* that maximizes the
expected cumulative reward (return) over time.
Solving an MDP:
Dynamic Programming: Techniques like Value Iteration and Policy Iteration
can be used to compute the optimal value function and policy iteratively.
Monte Carlo Methods: Monte Carlo methods estimate the value function by
sampling episodes of the agent interacting with the environment and
averaging the observed returns.
Loading [MathJax]/extensions/Safe.js
Feature scaling should be applied whenever the scale or magnitude of
features varies significantly, or when using algorithms sensitive to feature
scales such as SVM, kNN, and algorithms based on gradient descent
optimization.
Kernel functions enable SVMs to implicitly map the input features from
the original space into a higher-dimensional feature space, where the
data may become linearly separable. This transformation allows SVMs to
handle nonlinear decision boundaries effectively.
2. Computational Efficiency:
Loading [MathJax]/extensions/Safe.js
The linear kernel performs no transformation and corresponds to a linear
decision boundary in the original input space.
2. Polynomial Kernel ((K(x, y) = (x^T y + c)^d)):
The sigmoid kernel maps the data into a feature space using hyperbolic
tangent functions, suitable for problems with non-Gaussian distributions
and complex decision boundaries.
Training a GMM:
1. Initialization:
To train a GMM, initial values for the means, covariance matrices, and
mixing coefficients are typically chosen randomly or using a predefined
strategy.
2. Expectation-Maximization (EM) Algorithm:
Applications of GMM:
1. Clustering:
Loading [MathJax]/extensions/Safe.js
GMMs can be used for clustering tasks, where they assign data points to
clusters based on their probability of belonging to each component. This
allows for soft clustering, where data points may belong to multiple
clusters with different probabilities.
2. Density Estimation:
GMMs can estimate the probability density function of the data, enabling
density-based anomaly detection, generation of synthetic data, and
visualization of data distributions.
3. Data Compression:
Imgur
Centroid-based Yes No
Hierarchical (Produces a
Flat (Produces a
Structure dendrogram/tree-like
partition of data)
structure)
Loading [MathJax]/extensions/Safe.js
through the network, allowing for effective optimization of the network's
parameters (weights and biases).
4. Dealing with Vanishing/Exploding Gradients:
Sigmoid functions squish the input values into the range (0, 1), making
them suitable for binary classification tasks. However, they suffer from
the vanishing gradient problem.
2. Hyperbolic Tangent (Tanh) Function:
Tanh functions squash the input values into the range (-1, 1), offering a
stronger non-linearity compared to sigmoid functions.
3. Rectified Linear Unit (ReLU):
ReLU functions are piecewise linear and set negative inputs to zero while
leaving positive inputs unchanged. They are computationally efficient
and have become the default choice for many neural network
architectures.
4. Leaky ReLU:
ELU functions resemble ReLU functions for positive inputs but have an
exponential component for negative inputs, which helps alleviate the
vanishing gradient problem.
Benefits of Dropout:
1. Prevents Overfitting:
Loading [MathJax]/extensions/Safe.js
Dropout can accelerate the training process by effectively training
multiple subnetworks in parallel, leading to faster convergence and
better optimization.
Non-parametric models learn from the training data itself, adapting their
complexity to fit the training data more closely. This allows them to
capture complex relationships in the data without making strong
assumptions about the underlying data distribution.
4. Potentially Higher Computational Cost:
29. Define the terms ROC curve and AUC in the context
of classification models.
In the context of classification models, the Receiver Operating Characteristic
(ROC) curve and the Area Under the ROC Curve (AUC) are commonly used
evaluation metrics for assessing the performance of binary classifiers.
ROC Curve:
1. Definition:
The ROC curve plots the true positive rate (TPR) on the y-axis against the
false positive rate (FPR) on the x-axis for various threshold values used
to classify instances as positive or negative.
3. Interpretation:
A classifier that performs well will have an ROC curve that hugs the
upper left corner of the plot, indicating high TPR and low FPR across
Loading [MathJax]/extensions/Safe.js
different threshold values. A random classifier would produce an ROC
curve that is close to the diagonal line (y = x).
4. Threshold Selection:
The AUC, or Area Under the ROC Curve, quantifies the overall
performance of a binary classification model across all possible threshold
values. It represents the area under the ROC curve and ranges from 0 to
1.
2. Interpretation:
Imgur
Imgur
Definition:
Cross-entropy loss quantifies the difference between the predicted probability
distribution and the true distribution of class labels. It penalizes incorrect
predictions by assigning higher loss to them, encouraging the model to make
more accurate predictions.
Interpretation:
Cross-entropy loss is minimized when the predicted probability distribution
closely matches the true distribution of class labels. In binary classification, the
loss is higher when the predicted probability diverges from the true label.
Similarly, in multiclass classification, the loss increases as the predicted
probability assigned to the true class decreases.
Applications:
Training Neural Networks: Cross-entropy loss is commonly used as the
objective function during the training of neural networks for classification
tasks. Minimizing the loss helps in adjusting the model parameters to
improve prediction accuracy.
Application:
L1 regularization (Lasso regression) is often preferred when feature selection
is desirable or when dealing with high-dimensional datasets with many
irrelevant features.
L2 regularization (Ridge regression) is commonly used when all features are
expected to contribute to the model's performance, or when the dataset
contains multicollinear features.
Key Concepts:
1. Dimensionality Reduction:
Loading [MathJax]/extensions/Safe.js
explained by each principal component is indicated by its corresponding
eigenvalue.
4. Orthogonal Transformation:
Workflow:
1. Standardization:
PCA typically begins with standardizing the features to have zero mean
and unit variance. This ensures that features with larger scales do not
dominate the analysis.
2. Covariance Matrix Calculation:
Finally, PCA projects the original dataset onto the selected principal
components to obtain the lower-dimensional representation of the data.
Applications:
1. Dimensionality Reduction:
PCA can be used for feature extraction and engineering, where the
principal components serve as new features that capture the most
important patterns in the data.
Imgur
Key Concepts:
1. Ensemble Learning:
Workflow:
1. Bootstrapping:
Loading [MathJax]/extensions/Safe.js
Random Forest randomly selects samples with replacement from the
training data to create multiple bootstrap samples. Each bootstrap
sample is used to train a decision tree.
2. Feature Subsetting:
Each decision tree is grown recursively by selecting the best split at each
node based on a criterion such as Gini impurity (for classification) or
mean squared error (for regression). The tree grows until a stopping
criterion is met, such as reaching a maximum depth or minimum number
of samples per leaf.
4. Voting or Averaging:
For classification tasks, the mode of the class labels predicted by all
trees is taken as the final output. For regression tasks, the mean
prediction of all trees is computed.
Applications:
1. Classification and Regression:
Advantages:
Random Forest is robust to overfitting, thanks to the averaging of multiple
trees and feature randomness.
It performs well on both classification and regression tasks and is less
sensitive to noisy data.
Imgur
Loading [MathJax]/extensions/Safe.js
Clustering is a fundamental unsupervised learning technique used to group
similar objects or data points together based on their inherent characteristics or
features. The goal of clustering is to partition the dataset into clusters, where
data points within the same cluster are more similar to each other than to those
in other clusters.
Key Concepts:
1. Unsupervised Learning:
Clustering algorithms aim to group data points into clusters such that
data points within the same cluster are more similar to each other and
dissimilar to those in other clusters. The notion of similarity is defined
based on the chosen distance metric or similarity measure.
3. Cluster Centroids or Prototypes:
Loading [MathJax]/extensions/Safe.js
splitting clusters based on similarity measures. They produce a
dendrogram that represents the cluster hierarchy.
3. Density-Based Methods:
Applications:
1. Customer Segmentation:
Imgur
Loading [MathJax]/extensions/Safe.js
Key Concepts:
1. Binary Classification:
Workflow:
1. Model Training:
Once trained, the Logistic Regression model can predict the probability
that a given input belongs to the positive class using the logistic
function. By applying a threshold (e.g., 0.5), the predicted probabilities
can be converted into binary class labels.
Applications:
Loading [MathJax]/extensions/Safe.js
1. Medical Diagnosis:
Advantages:
Logistic Regression is computationally efficient and easy to implement.
It provides interpretable results, allowing for the analysis of the contribution
of individual features to the classification decision.
Imgur
Unsupervised learning
Learning Type Supervised learning algorithm
algorithm
Parameter
K value (number of neighbors) K value (number of clusters)
Selection
Loading [MathJax]/extensions/Safe.js
Feature KNN (K-Nearest Neighbors) K-means Clustering
Handling
Sensitive to outliers Sensitive to outliers
Outliers
Customer segmentation,
Application Classification, regression,
image compression, anomaly
Areas recommendation systems
detection
Sigmoid Function:
Purpose: The Sigmoid function is commonly used to produce binary output
in binary classification tasks or to squash the output of a neural network
layer to a range between 0 and 1.
Output Range: The output of the Sigmoid function is always between 0 and
1, making it suitable for binary classification tasks where the output
represents the probability of belonging to the positive class.
Softmax Function:
Purpose: The Softmax function is used to produce a probability distribution
over multiple classes in multi-class classification tasks, where the output
Loading [MathJax]/extensions/Safe.js
represents the probabilities of belonging to each class.
Output Range: The output of the Softmax function is a probability
distribution where each element is between 0 and 1, and the sum of all
elements equals 1.
Key Differences:
1. Output Range: Sigmoid outputs a single probability between 0 and 1 for
binary classification, while Softmax outputs a probability distribution over
multiple classes, ensuring that the sum of probabilities is 1.
Imgur
Example:
Load the dataset and perform any necessary preprocessing steps, such
as handling missing values, encoding categorical variables, and scaling
features.
Loading [MathJax]/extensions/Safe.js
3. Split Data:
Split the dataset into training and testing sets to evaluate the
performance of the algorithm.
4. Define Distance Metric:
Use the testing set to evaluate the performance of the KNN algorithm.
Calculate metrics such as accuracy, precision, recall, and F1-score to
assess the model's performance.
8. Deploy Model:
Code:
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.model_selection import train_test_split
iris_dataset=load_iris()
A_train, A_test, B_train, B_test =
ztrain_test_split(iris_dataset["data"], iris_dataset["target"],
random_state=0)
kn = KNeighborsClassifier(n_neighbors=1)
kn.fit(A_train, B_train)
A_new = np.array([[8, 2.5, 1, 1.2]])
prediction = kn.predict(A_new)
print("Predicted target value: {}n".format(prediction))
print("Predicted feature name: {}n".format
(iris_dataset["target_names"][prediction]))
print("Test score: {:.2f}".format(kn.score(A_test, B_test)))
Output:
Loading [MathJax]/extensions/Safe.js
Predicted Target Name: [0]
Predicted Feature Name: [‘ Setosa’]
Incorporates momentum-like
Does not explicitly incorporate
behavior by using exponentially
Momentum momentum, but it can be
moving averages of gradients and
combined with momentum.
squared gradients.
Commonly used as an
Widely used and often alternative to Adam,
Usage recommended for training deep particularly when
neural networks. computational resources are
limited.
1. Tokenization:
Loading [MathJax]/extensions/Safe.js
determining the syntactic roles of words and their relationships with
other words in the sentence.
3. Phrase Structure Parsing:
Imgur
Example:
Lemmatization:
Example:
Imgur
The variance of the errors should be constant across all levels of the
independent variables. This assumption ensures that the spread of the
residuals remains consistent throughout the range of the predictor
variables.
4. Normality of Errors:
These assumptions are essential for the validity and reliability of the linear
regression model's results.
First, identify the minority class in the dataset, i.e., the class with fewer
instances compared to the majority class(es).
2. Select Samples:
For each sample in the minority class, find its k nearest neighbors. The
number of neighbors (k) is typically chosen based on a hyperparameter.
3. Generate Synthetic Samples:
For each minority class sample, randomly select one of its k nearest
neighbors. Then, generate a synthetic sample along the line connecting
the selected sample and its neighbor in the feature space.
4. Repeat:
Repeat steps 2 and 3 until the desired balance between the minority and
majority classes is achieved.
Advantages of SMOTE:
Considerations:
The choice of the number of nearest neighbors (k) and the strategy for
generating synthetic samples can affect the performance of SMOTE and
should be carefully tuned.
SMOTE may not be suitable for all types of datasets, particularly those with
complex class distributions or overlapping classes.
1. Ensemble
Loading [MathJax]/extensions/Safe.js Learning:
XGBoost is based on the ensemble learning paradigm, where multiple
models (weak learners) are combined to create a stronger model. It builds an
ensemble of decision trees sequentially, with each tree learning to correct
the errors of the previous ones.
2. Gradient Boosting:
XGBoost uses the gradient boosting framework, where each new tree is
trained to predict the gradient (residuals) of the loss function of the previous
trees. This approach focuses on minimizing the errors made by the ensemble
model, leading to incremental improvements in prediction accuracy.
3. Tree Boosting:
4. Regularization:
5. Parallel Processing:
7. Feature Importance:
8. Early Stopping:
Loading [MathJax]/extensions/Safe.js
XGBoost supports early stopping, a technique to prevent overfitting by
monitoring the performance of the model on a separate validation dataset
during training. Training stops when the performance on the validation set
stops improving.