Qa DL
Qa DL
Explain the terms Artificial Intelligence (AI), Machine Learning (ML) and Deep
Learning?
Artificial Intelligence (AI) is the domain of producing intelligent machines. ML
refers to systems that can assimilate from experience (training data) and Deep
Learning (DL) states to systems that learn from experience on large data sets. ML
can be considered as a subset of AI. Deep Learning (DL) is ML but useful to large
data sets. The figure below roughly encapsulates the relation between AI, ML, and
DL:
In summary, DL is a subset of ML & both were the subsets of AI.
Additional Information: ASR (Automatic Speech Recognition) & NLP (Natural Language
Processing) fall under AI and overlay with ML & DL as ML is often utilized for NLP
and ASR tasks.
2. What are the different types of Learning/ Training models in ML?
ML algorithms can be primarily classified depending on the presence/absence of
target variables.
A. Supervised learning: [Target is present]The machine learns using labelled data.
The model is trained on an existing data set before it starts making decisions with
the new data.The target variable is continuous: Linear Regression, polynomial
Regression, and quadratic Regression.The target variable is categorical: Logistic
regression, Naive Bayes, KNN, SVM, Decision Tree, Gradient Boosting, ADA boosting,
Bagging, Random forest etc.
B. Unsupervised learning: [Target is absent]The machine is trained on unlabelled
data and without any proper guidance. It automatically infers patterns and
relationships in the data by creating clusters. The model learns through
observations and deduced structures in the data.Principal component Analysis,
Factor analysis, Singular Value Decomposition etc.
C. Reinforcement Learning:The model learns through a trial and error method. This
kind of learning involves an agent that will interact with the environment to
create actions and then discover errors or rewards of that action.
Machine Learning involves algorithms that learn from patterns of data and then
apply it to decision making. Deep Learning, on the other hand, is able to learn
through processing data on its own and is quite similar to the human brain where it
identifies something, analyse it, and makes a decision.The key differences are as
follows:
4. What is the main key difference between supervised and unsupervised machine
learning?
Supervised learningUnsupervised learningThe supervised learning technique needs
labelled data to train the model. For example, to solve a classification problem (a
supervised learning task), you need to have label data to train the model and to
classify the data into your labelled groups.Unsupervised learning does not need any
labelled dataset. This is the main key difference between supervised learning and
unsupervised learning.
6. There are many machine learning algorithms till now. If given a data set, how
can one determine which algorithm to be used for that?
Machine Learning algorithm to be used purely depends on the type of data in a given
dataset. If data is linear then, we use linear regression. If data shows non-
linearity then, the bagging algorithm would do better. If the data is to be
analyzed/interpreted for some business purposes then we can use decision trees or
SVM. If the dataset consists of images, videos, audios then, neural networks would
be helpful to get the solution accurately.
So, there is no certain metric to decide which algorithm to be used for a given
situation or a data set. We need to explore the data using EDA (Exploratory Data
Analysis) and understand the purpose of using the dataset to come up with the best
fit algorithm. So, it is important to study all the algorithms in detail.
18. Explain the handling of missing or corrupted values in the given dataset.
An easy way to handle missing values or corrupted values is to drop the
corresponding rows or columns. If there are too many rows or columns to drop then
we consider replacing the missing or corrupted values with some new value.
Identifying missing values and dropping the rows or columns can be done by using
IsNull() and dropna( ) functions in Pandas. Also, the Fillna() function in Pandas
replaces the incorrect values with the placeholder value.
19. What is Time series?
A Time series is a sequence of numerical data points in successive order. It tracks
the movement of the chosen data points, over a specified period of time and records
the data points at regular intervals. Time series doesn’t require any minimum or
maximum time input. Analysts often use Time series to examine data according to
their specific requirement.
24. Explain the differences between Random Forest and Gradient Boosting machines.
Random ForestsGradient BoostingRandom forests are a significant number of decision
trees pooled using averages or majority rules at the end.Gradient boosting machines
also combine decision trees but at the beginning of the process, unlike Random
forests.The random forest creates each tree independent of the others while
gradient boosting develops one tree at a time.Gradient boosting yields better
outcomes than random forests if parameters are carefully tuned but it’s not a good
option if the data set contains a lot of outliers/anomalies/noise as it can result
in overfitting of the model.Random forests perform well for multiclass object
detection.Gradient Boosting performs well when there is data which is not balanced
such as in real-time risk assessment.
25. What is a confusion matrix and why do you need it?
Confusion matrix (also called the error matrix) is a table that is frequently used
to illustrate the performance of a classification model i.e. classifier on a set of
test data for which the true values are well-known.
It allows us to visualize the performance of an algorithm/model. It allows us to
easily identify the confusion between different classes. It is used as a
performance measure of a model/algorithm.
A confusion matrix is known as a summary of predictions on a classification model.
The number of right and wrong predictions were summarized with count values and
broken down by each class label. It gives us information about the errors made
through the classifier and also the types of errors made by a classifier.
Support is a measure of how often the “item set” appears in the data set and
Confidence is a measure of how often a particular rule has been found to be true.
28. What is Marginalisation? Explain the process.
Marginalisation is summing the probability of a random variable X given joint
probability distribution of X with other variables. It is an application of the law
of total probability.
P(X=x) = ∑YP(X=x,Y)
Given the joint probability P(X=x,Y), we can use marginalization to find P(X=x).
So, it is to find distribution of one random variable by exhausting cases on other
random variables.
29. Explain the phrase “Curse of Dimensionality”.
The Curse of Dimensionality refers to the situation when your data has too many
features.
The phrase is used to express the difficulty of using brute force or grid search to
optimize a function with too many inputs.
It can also refer to several other issues like:
If we have more features than observations, we have a risk of overfitting the
model.
When we have too many features, observations become harder to cluster. Too many
dimensions cause every observation in the dataset to appear equidistant from all
others and no meaningful clusters can be formed.
Dimensionality reduction techniques like PCA come to the rescue in such cases.
30. What is the Principle Component Analysis?
The idea here is to reduce the dimensionality of the data set by reducing the
number of variables that are correlated with each other. Although the variation
needs to be retained to the maximum extent.
The variables are transformed into a new set of variables that are known as
Principal Components’. These PCs are the eigenvectors of a covariance matrix and
therefore are orthogonal.
A data point that is considerably distant from the other similar data points is
known as an outlier. They may occur due to experimental errors or variability in
measurement. They are problematic and can mislead a training process, which
eventually results in longer training time, inaccurate models, and poor results.
The three methods to deal with outliers are:Univariate method – looks for data
points having extreme values on a single variableMultivariate method – looks for
unusual combinations on all the variablesMinkowski error – reduces the contribution
of potential outliers in the training process
Also Read - Advantages of pursuing a career in Machine Learning
33. What is the difference between regularization and normalisation?
NormalisationRegularisationNormalisation adjusts the data; . If your data is on
very different scales (especially low to high), you would want to normalise the
data. Alter each column to have compatible basic statistics. This can be helpful to
make sure there is no loss of accuracy. One of the goals of model training is to
identify the signal and ignore the noise if the model is given free rein to
minimize error, there is a possibility of suffering from overfitting.Regularisation
adjusts the prediction function. Regularization imposes some control on this by
providing simpler fitting functions over complex ones.
34. Explain the difference between Normalization and Standardization.
Normalization and Standardization are the two very popular methods used for feature
scaling.
NormalisationStandardizationNormalization refers to re-scaling the values to fit
into a range of [0,1].Normalization is useful when all parameters need to have an
identical positive scale however the outliers from the data set are
lost.Standardization refers to re-scaling data to have a mean of 0 and a standard
deviation of 1 (Unit variance)
35. List the most popular distribution curves along with scenarios where you will
use them in an algorithm.
The most popular distribution curves are as follows- Bernoulli Distribution,
Uniform Distribution, Binomial Distribution, Normal Distribution, Poisson
Distribution, and Exponential Distribution. Check out the free Probability for
Machine Learning course to enhance your knowledge on Probability Distributions for
Machine Learning.Each of these distribution curves is used in various scenarios.
Bernoulli Distribution can be used to check if a team will win a championship or
not, a newborn child is either male or female, you either pass an exam or not, etc.
Uniform distribution is a probability distribution that has a constant probability.
Rolling a single dice is one example because it has a fixed number of outcomes.
Binomial distribution is a probability with only two possible outcomes, the prefix
‘bi’ means two or twice. An example of this would be a coin toss. The outcome will
either be heads or tails.
Normal distribution describes how the values of a variable are distributed. It is
typically a symmetric distribution where most of the observations cluster around
the central peak. The values further away from the mean taper off equally in both
directions. An example would be the height of students in a classroom.
Poisson distribution helps predict the probability of certain events happening when
you know how often that event has occurred. It can be used by businessmen to make
forecasts about the number of customers on certain days and allows them to adjust
supply according to the demand.
Exponential distribution is concerned with the amount of time until a specific
event occurs. For example, how long a car battery would last, in months.
36. How do we check the normality of a data set or a feature?
Visually, we can check it using plots. There is a list of Normality checks, they
are as follow:
Shapiro-Wilk W Test
Anderson-Darling Test
Martinez-Iglewicz Test
Kolmogorov-Smirnov Test
D’Agostino Skewness Test
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
41. When does the linear regression line stop rotating or finds an optimal spot
where it is fitted on data?
A place where the highest RSquared value is found, is the place where the line
comes to rest. RSquared represents the amount of variance captured by the virtual
linear regression line with respect to the total variance captured by the dataset.
42. Why is logistic regression a type of classification technique and not a
regression? Name the function it is derived from?
Since the target column is categorical, it uses linear regression to create an odd
function that is wrapped with a log function to use regression as a
classifier. Hence, it is a type of classification technique and not a regression.
It is derived from cost function.
43. What could be the issue when the beta value for a certain variable varies way
too much in each subset when regression is run on different subsets of the given
dataset?
Variations in the beta values in every subset implies that the dataset is
heterogeneous. To overcome this problem, we can use a different model for each of
the dataset’s clustered subsets or a non-parametric model such as decision trees.
44. What does the term Variance Inflation Factor mean?
Variation Inflation Factor (VIF) is the ratio of the model’s variance to the
model’s variance with only one independent variable. VIF gives the estimate of the
volume of multicollinearity in a set of many regression variables.
VIF = Variance of the model with one independent variable
45. Which machine learning algorithm is known as the lazy learner, and why is it
called so?
KNN is a Machine Learning algorithm known as a lazy learner. K-NN is a lazy learner
because it doesn’t learn any machine-learned values or variables from the training
data but dynamically calculates distance every time it wants to classify, hence
memorizing the training dataset instead.
Machine Learning Interview Questions for Experienced
We know what the companies are looking for, and with that in mind, we have prepared
the set of Machine Learning interview questions an experienced professional may be
asked. So, prepare accordingly if you wish to ace the interview in one go.
46. Is it possible to use KNN for image processing?
Yes, it is possible to use KNN for image processing. It can be done by converting
the 3-dimensional image into a single-dimensional vector and using the same as
input to KNN.
47. Differentiate between K-Means and KNN algorithms?
KNN algorithmsK-MeansKNN algorithms is Supervised Learning where-as K-Means is
Unsupervised Learning. With KNN, we predict the label of the unidentified element
based on its nearest neighbour and further extend this approach for solving
classification/regression-based problems.K-Means is Unsupervised Learning, where we
don’t have any Labels present, in other words, no Target Variables and thus we try
to cluster the data based upon their coord
K fold
Stratified k fold
Leave one out
Bootstrapping
Random search cv
Grid search cv
57. Is it possible to test for the probability of improving model accuracy without
cross-validation techniques? If yes, please explain.
Yes, it is possible to test for the probability of improving model accuracy without
cross-validation techniques. We can do so by running the ML model for say n number
of iterations, recording the accuracy. Plot all the accuracies and remove the 5% of
low probability values. Measure the left [low] cut off and right [high] cut off.
With the remaining 95% confidence, we can say that the model can go as low or as
high [as mentioned within cut off points].
58. Name a popular dimensionality reduction algorithm.
Popular dimensionality reduction algorithms are Principal Component Analysis and
Factor Analysis.Principal Component Analysis creates one or more index variables
from a larger set of measured variables. Factor Analysis is a model of the
measurement of a latent variable. This latent variable cannot be measured with a
single variable and is seen through a relationship it causes in a set of y
variables.
59. How can we use a dataset without the target variable into supervised learning
algorithms?
Input the data set into a clustering algorithm, generate optimal clusters, label
the cluster numbers as the new target variable. Now, the dataset has independent
and target variables present. This ensures that the dataset is ready to be used in
supervised learning algorithms.
60. List all types of popular recommendation systems? Name and explain two
personalized recommendation systems along with their ease of implementation.
Popularity based recommendation, content-based recommendation, user-based
collaborative filter, and item-based recommendation are the popular types of
recommendation systems.Personalized Recommendation systems are- Content-based
recommendations, user-based collaborative filter, and item-based recommendations.
User-based collaborative filter and item-based recommendations are more
personalized. Easy to maintain: Similarity matrix can be maintained easily with
Item-based recommendations.
61. How do we deal with sparsity issues in recommendation systems? How do we
measure its effectiveness? Explain.
Singular value decomposition can be used to generate the prediction matrix. RMSE is
the measure that helps us understand how close the prediction matrix is to the
original matrix.
62. Name and define techniques used to find similarities in the recommendation
system.
Pearson correlation and Cosine correlation are techniques used to find similarities
in recommendation systems.
63. State the limitations of Fixed Basis Function.
Linear separability in feature space doesn’t imply linear separability in input
space. So, Inputs are non-linearly transformed using vectors of basic functions
with increased dimensionality. Limitations of Fixed basis functions are:
Non-Linear transformations cannot remove overlap between two classes but they can
increase overlap.
Often it is not clear which basis functions are the best fit for a given task. So,
learning the basic functions can be useful over using fixed basis functions.
If we want to use only fixed ones, we can use a lot of them and let the model
figure out the best fit but that would lead to overfitting the model thereby making
it unstable.
64. Define and explain the concept of Inductive Bias with some examples.
Inductive Bias is a set of assumptions that humans use to predict outputs given
inputs that the learning algorithm has not encountered yet. When we are trying to
learn Y from X and the hypothesis space for Y is infinite, we need to reduce the
scope by our beliefs/assumptions about the hypothesis space which is also called
inductive bias. Through these assumptions, we constrain our hypothesis space and
also get the capability to incrementally test and improve on the data using hyper-
parameters. Examples:
The metric used to access the performance of the classification model is Confusion
Metric. Confusion Metric can be further interpreted with the following terms:-
True Positives (TP) – These are the correctly predicted positive values. It implies
that the value of the actual class is yes and the value of the predicted class is
also yes.
True Negatives (TN) – These are the correctly predicted negative values. It implies
that the value of the actual class is no and the value of the predicted class is
also no.
False positives and false negatives, these values occur when your actual class
contradicts with the predicted class.
Now,Recall, also known as Sensitivity is the ratio of true positive rate (TP), to
all observations in actual class – yesRecall = TP/(TP+FN)
Precision is the ratio of positive predictive value, which measures the amount of
accurate positives model predicted viz a viz number of positives it
claims.Precision = TP/(TP+FP)
Accuracy is the most intuitive performance measure and it is simply a ratio of
correctly predicted observation to the total observations.Accuracy =
(TP+TN)/(TP+FP+FN+TN)
F1 Score is the weighted average of Precision and Recall. Therefore, this score
takes both false positives and false negatives into account. Intuitively it is not
as easy to understand as accuracy, but F1 is usually more useful than accuracy,
especially if you have an uneven class distribution. Accuracy works best if false
positives and false negatives have a similar cost. If the cost of false positives
and false negatives are very different, it’s better to look at both Precision and
Recall.
68. Plot validation score and training score with data set size on the x-axis and
another plot with model complexity on the x-axis.
For high bias in the models, the performance of the model on the validation data
set is similar to the performance on the training data set. For high variance in
the models, the performance of the model on the validation set is worse than the
performance on the training set.
69. What is Bayes’ Theorem? State at least 1 use case with respect to the machine
learning context?
Bayes’ Theorem describes the probability of an event, based on prior knowledge of
conditions that might be related to the event. For example, if cancer is related to
age, then, using Bayes’ theorem, a person’s age can be used to more accurately
assess the probability that they have cancer than can be done without the knowledge
of the person’s age.Chain rule for Bayesian probability can be used to predict the
likelihood of the next word in the sentence.
70. What is Naive Bayes? Why is it Naive?
Naive Bayes classifiers are a series of classification algorithms that are based on
the Bayes theorem. This family of algorithm shares a common principle which treats
every pair of features independently while being classified.
Naive Bayes is considered Naive because the attributes in it (for the class) is
independent of others in the same class. This lack of dependence between two
attributes of the same class creates the quality of naiveness.Read more about Naive
Bayes.
71. Explain how a Naive Bayes Classifier works.
Naive Bayes classifiers are a family of algorithms which are derived from the Bayes
theorem of probability. It works on the fundamental assumption that every set of
two features that is being classified is independent of each other and every
feature makes an equal and independent contribution to the outcome.
72. What do the terms prior probability and marginal likelihood in context of Naive
Bayes theorem mean?
Prior probability is the percentage of dependent binary variables in the data set.
If you are given a dataset and dependent variable is either 1 or 0 and percentage
of 1 is 65% and percentage of 0 is 35%. Then, the probability that any new input
for that variable of being 1 would be 65%.
Marginal likelihood is the denominator of the Bayes equation and it makes sure that
the posterior probability is valid by making its area 1.
73. Explain the difference between Lasso and Ridge?
Lasso(L1) and Ridge(L2) are the regularization techniques where we penalize the
coefficients to find the optimum solution. In ridge, the penalty function is
defined by the sum of the squares of the coefficients and for the Lasso, we
penalize the sum of the absolute values of the coefficients. Another type of
regularization method is ElasticNet, it is a hybrid penalizing function of both
lasso and ridge.
74. What’s the difference between probability and likelihood?
Probability is the measure of the likelihood that an event will occur that is, what
is the certainty that a specific event will occur? Where-as a likelihood function
is a function of parameters within the parameter space that describes the
probability of obtaining the observed data.So the fundamental difference is,
Probability attaches to possible results; likelihood attaches to hypotheses.
76. Model accuracy or Model performance? Which one will you prefer and why?
This is a trick question, one should first get a clear idea, what is Model
Performance? If Performance means speed, then it depends upon the nature of the
application, any application related to the real-time scenario will need high speed
as an important feature. Example: The best of Search Results will lose its virtue
if the Query results do not appear fast.
If Performance is hinted at Why Accuracy is not the most important virtue – For any
imbalanced data set, more than Accuracy, it will be an F1 score than will explain
the business case and in case data is imbalanced, then Precision and Recall will be
more important than rest.
77. List the advantages and limitations of the Temporal Difference Learning Method.
Temporal Difference Learning Method is a mix of Monte Carlo method and Dynamic
programming method. Some of the advantages of this method include:
It is a biased estimation.
It is more sensitive to initialization.
Univariate visualization
Bivariate visualization
Multivariate visualization
80. Mention why feature engineering is important in model building and list
out some of the techniques used for feature engineering.
Algorithms necessitate features with some specific characteristics to work
appropriately. The data is initially in a raw form. You need to extract features
from this data before supplying it to the algorithm. This process is called feature
engineering. When you have relevant features, the complexity of the algorithms
reduces. Then, even if a non-ideal algorithm is used, results come out to be
accurate.
Feature engineering primarily has two goals:
Prepare the suitable input data set to be compatible with the machine learning
algorithm constraints.
Enhance the performance of machine learning models.
Some of the techniques used for feature engineering include Imputation, Binning,
Outliers Handling, Log transform, grouping operations, One-Hot encoding, Feature
split, Scaling, Extracting date.
81. Differentiate between Statistical Modeling and Machine Learning?
Machine learning models are about making accurate predictions about the situations,
like Foot Fall in restaurants, Stock-Price, etc. where-as, Statistical models are
designed for inference about the relationships between variables, as What drives
the sales in a restaurant, is it food or Ambience.
82. Differentiate between Boosting and Bagging?
Bagging and Boosting are variants of Ensemble Techniques.
Bootstrap Aggregation or bagging is a method that is used to reduce the variance
for algorithms having very high variance. Decision trees are a particular family of
classifiers which are susceptible to having high bias.
Decision trees have a lot of sensitiveness to the type of data they are trained on.
Hence generalization of results is often much more complex to achieve in them
despite very high fine-tuning. The results vary greatly if the training data is
changed in decision trees.
Hence bagging is utilised where multiple decision trees are made which are trained
on samples of the original data and the final result is the average of all these
individual models.
Boosting is the process of using an n-weak classifier system for prediction such
that every weak classifier compensates for the weaknesses of its classifiers. By
weak classifier, we imply a classifier which performs poorly on a given data set.
It’s evident that boosting is not an algorithm rather it’s a process. Weak
classifiers used are generally logistic regression, shallow decision trees etc.
There are many algorithms which make use of boosting processes but two of them are
mainly used: Adaboost and Gradient Boosting and XGBoost.
83. What is the significance of Gamma and Regularization in SVM?
The gamma defines influence. Low values meaning ‘far’ and high values meaning
‘close’. If gamma is too large, the radius of the area of influence of the support
vectors only includes the support vector itself and no amount of regularization
with C will be able to prevent overfitting. If gamma is very small, the model is
too constrained and cannot capture the complexity of the data.
The regularization parameter (lambda) serves as a degree of importance that is
given to miss-classifications. This can be used to draw the tradeoff with
OverFitting.
84. Define ROC curve work
The graphical representation of the contrast between true positive rates and the
false positive rate at various thresholds is known as the ROC curve. It is used as
a proxy for the trade-off between true positives vs the false positives.
Advantages:
Addition and deletion of records is time consuming even though we get the element
of interest immediately through random access. This is due to the fact that the
elements need to be reordered after insertion or deletion.
If contiguous blocks of memory are not available in the memory, then there is an
overhead on the CPU to search for the most optimal contiguous location available
for the requirement.
Now that we know what arrays are, we shall understand them in detail by solving
some interview questions. Before that, let us see the functions that Python as a
language provides for arrays, also known as, lists.
append() – Adds an element at the end of the listcopy() – returns a copy of a
list.reverse() – reverses the elements of the listsort() – sorts the elements in
ascending order by default.
96. What is Lists in Python?
import copy.deepcopy
a = [1,2]
b = [a,a] # there's only 1 object a
c = deepcopy(b)
Deep copy
n = len(arr)
right = prev_r = n-1
count = 0
# We start from rightmost index and travesre array to find the leftmost index
# from which we can reach index 'right'
while True:
for j in (range(prev_r-1,-1,-1)):
if j + arr[j] >= prev_r:
right = j
if prev_r != right:
prev_r = right
else:
break
count += 1
98. Given a string S consisting only ‘a’s and ‘b’s, print the last index of the ‘b’
present in it.
When we have are given a string of a’s and b’s, we can immediately find out the
first location of a character occurring. Therefore, to find the last occurrence of
a character, we reverse the string and find the first occurrence, which is
equivalent to the last occurrence in the original string.
Here, we are given input as a string. Therefore, we begin by splitting the
characters element wise using the function split. Later, we reverse the array, find
the first occurrence position value, and get the index by finding the value len –
position -1, where position is the index value.
def split(word):
return [(char) for char in word]
a = input()
a= split(a)
a_rev = a[::-1]
pos = -1
for i in range(len(a_rev)):
if a_rev[i] == ‘b’:
pos = len(a_rev)- i -1
print(pos)
break
else:
continue
if pos==-1:
print(-1)
99. Rotate the elements of an array by d positions to the left. Let us initially
look at an example.
A = [1,2,3,4,5]
A <<2
[3,4,5,1,2]
A<<3
[4,5,1,2,3]
There exists a pattern here, that is, the first d elements are being interchanged
with last n-d +1 elements. Therefore we can just swap the elements. Correct? What
if the size of the array is huge, say 10000 elements. There are chances of memory
error, run-time error etc. Therefore, we do it more carefully. We rotate the
elements one by one in order to prevent the above errors, in case of large arrays.
# Rotate all the elements left by 1 position
def rot_left_once ( arr):
n = len( arr)
tmp = arr [0]
for i in range ( n-1): #[0,n-2]
arr[i] = arr[i + 1]
arr[n-1] = tmp
Therefore, let us find start with the extreme elements, and move towards the
centre.
n = int(input())
arr = [int(i) for i in input().split()]
left, right = [arr[0]], [0] * n
# left =[arr[0]]
#right = [ 0 0 0 0…0] n terms
right[n-1] = arr[-1] # right most element
# we use two arrays left[ ] and right[ ], which keep track of elements greater than
all# elements the order of traversal respectively.
for elem in arr[1 : ]:
left.append(max(left[-1], elem) )
for i in range( len( arr)-2, -1, -1):
right[i] = max( arr[i] , right[i+1] )
water = 0
# once we have the arrays left, and right, we can find the water capacity between
these arrays.
Splitting criteria
Min_leaves
Min_samples
Max_depth
Manhattan
Minkowski
Tanimoto
Jaccard
Mahalanobis
Sometimes it also gives the impression that the data is noisy. Hence noise from
data should be removed so that most important signals are found by the model to
make effective predictions.
Increasing the number of epochs results in increasing the duration of training of
the model. It’s helpful in reducing the error.
128. Which type of sampling is better for a classification model and why?
Ans. Stratified sampling is better in case of classification problems because it
takes into account the balance of classes in train and test sets. The proportion of
classes is maintained and hence the model performs better. In case of random
sampling of data, the data is divided into two parts without taking into
consideration the balance classes in the train and test sets. Hence some classes
might be present only in tarin sets or validation sets. Hence the results of the
resulting model are poor in this case.
129. What is a good metric for measuring the level of multicollinearity?
Ans. VIF or 1/tolerance is a good measure of measuring multicollinearity in models.
VIF is the percentage of the variance of a predictor which remains unaffected by
other predictors. So higher the VIF value, greater is the multicollinearity amongst
the predictors.
A rule of thumb for interpreting the variance inflation factor:
1 = not correlated.
Between 1 and 5 = moderately correlated.
Greater than 5 = highly correlated.
130. When can be a categorical value treated as a continuous variable and what
effect does it have when done so?
Ans. A categorical predictor can be treated as a continuous one when the nature of
data points it represents is ordinal. If the predictor variable is having ordinal
data then it can be treated as continuous and its inclusion in the model increases
the performance of the model.
131. What is the role of maximum likelihood in logistic regression.
Ans. Maximum likelihood equation helps in estimation of most probable values of the
estimator’s predictor variable coefficients which produces results which are the
most likely or most probable and are quite close to the truth values.
132. Which distance do we measure in the case of KNN?
Ans. The hamming distance is measured in case of KNN for the determination of
nearest neighbours. Kmeans uses euclidean distance.
133. What is a pipeline?
Ans. A pipeline is a sophisticated way of writing software such that each intended
action while building a model can be serialized and the process calls the
individual functions for the individual tasks. The tasks are carried out in
sequence for a given sequence of data points and the entire process can be run onto
n threads by use of composite estimators in scikit learn.
134. Which sampling technique is most suitable when working with time-series data?
Ans. We can use a custom iterative sampling such that we continuously add samples
to the train set. We only should keep in mind that the sample used for validation
should be added to the next train sets and a new sample is used for validation.
135. What are the benefits of pruning?
Ans. Pruning helps in the following:
Reduces overfitting
Shortens the size of the tree
Reduces complexity of the model
Increases bias
Work well with small dataset compared to DT which need more data
Lesser overfitting
Smaller in size and faster in processing
Decision Trees:
Decision Trees are very flexible, easy to understand, and easy to debug
No preprocessing or transformation of features required
Prone to overfitting but you can use pruning or Random forests to avoid that.
Akaike Information Criteria (AIC): In simple terms, AIC estimates the relative
amount of information lost by a given model. So the less information lost the
higher the quality of the model. Therefore, we always prefer models with minimum
AIC.
Receiver operating characteristics (ROC curve): ROC curve illustrates the
diagnostic ability of a binary classifier. It is calculated/ created by plotting
True Positive against False Positive at various threshold settings. The performance
metric of ROC curve is AUC (area under curve). Higher the area under the curve,
better the prediction power of the model.
Confusion Matrix: In order to find out how well the model does in predicting the
target variable, we use a confusion matrix/ classification rate. It is nothing but
a tabular representation of actual Vs predicted values which helps us to find the
accuracy of the model.
159. Are Gaussian Naive Bayes the same as binomial Naive Bayes?
Binomial Naive Bayes: It assumes that all our features are binary such that they
take only two values. Means 0s can represent “word does not occur in the document”
and 1s as “word occurs in the document”.
Gaussian Naive Bayes: Because of the assumption of the normal distribution,
Gaussian Naive Bayes is used in cases when all our features are continuous. For
example in Iris dataset features are sepal width, petal width, sepal length, petal
length. So its features can have different values in the data set as width and
length can vary. We can’t represent features in terms of their occurrences. This
means data is continuous. Hence we use Gaussian Naive Bayes here.
160. What is the difference between the Naive Bayes Classifier and the Bayes
classifier?
Naive Bayes assumes conditional independence, P(X|Y, Z)=P(X|Z)
P(X|Y,Z)=P(X|Z)
P(X|Y,Z)=P(X|Z), Whereas more general Bayes Nets (sometimes called Bayesian Belief
Networks), will allow the user to specify which attributes are, in fact,
conditionally independent.
For the Bayesian network as a classifier, the features are selected based on some
scoring functions like Bayesian scoring function and minimal description length(the
two are equivalent in theory to each other given that there is enough training
data). The scoring functions mainly restrict the structure (connections and
directions) and the parameters(likelihood) using the data. After the structure has
been learned the class is only determined by the nodes in the Markov blanket(its
parents, its children, and the parents of its children), and all variables given
the Markov blanket are discarded.
161. In what real world applications is Naive Bayes classifier used?
Some of real world examples are as given below
Discriminant Functions
Probabilistic Generative Models
Bayesian Theorem
Naive Assumptions of Independence and Equal Importance of feature vectors.
Understand the business model: Try to understand the related attributes for the
spam mail
Data acquisitions: Collect the spam mail to read the hidden pattern from them
Data cleaning: Clean the unstructured or semi structured data
Exploratory data analysis: Use statistical concepts to understand the data like
spread, outlier, etc.
Use machine learning algorithms to make a model: can use naive bayes or some other
algorithms as well
Use unknown dataset to check the accuracy of the model
1) What are the basic differences between Machine Learning and Deep Learning?
Differences between Machine Learning and Deep Learning are:
Machine LearningDeep LearningDefinitionSub-discipline of AIA subset of machine
learningDataParses the dataCreates an artificial neural networkAccuracyRequires
manual intervention means decreased accuracySelf-learning capabilities mean higher
accuracyInterpretabilityMachine Learning is Faster10 Times Faster than MLOutputML
models produce a numerical outputDL algorithms can range from an image to text or
even an audioData dependenciesHighLowHardware dependenciesCan work on low-end
machines.Heavily depend on high-end machinesFutureEffective with image recognition
and face recognition in mobilesNot much effective due to data processing
limitations
2) What is the difference between Bias and Variance?
Bias: Bias can be defined as a situation where an error has occurred due to the use
of assumptions in the learning algorithm.
Variance: Variance is an error caused because of the complexity of the algorithm
that is been used to analyze the data.
Model Building: In this stage, we will choose the ideal algorithm for the model,
and we will train it based on our requirements.
Model Testing: In this stage, we will check the model's accuracy by using test
data.
Applying Model: After testing, we have to make the changes, and then we can use the
model for real-time projects.
Fraud Identification: Supervised learning trains the model for identifying the
suspicious patterns; we can identify the feasible fraud instances.
Healthcare: By giving images about a disease, supervised machine learning can train
the model for detecting whether a person is affected by illness or not.
Email spam identification: We train the model through historical data which
contains emails that are classified as spam or not spam. This labeled data is
supplied as the input to the model.
Sentiment Analysis: This relates to the process of using algorithms for mining the
documents and determining if they are negative, neutral, positive in sentiment.
Clustering: It includes the data that must be divided into subsets. These subsets
are also known as clusters. Diverse clusters disclose details about objects, unlike
regression or classification.
Association: In the association problem, we can recognize the association patterns
between different items and variables. For instance, e-commerce can indicate other
items for us to buy according to our previous purchases.
Recall: It is known as a true positive rate. The number of positives that your
model has claimed compared to the actual defined number of positives available
throughout the data.
Precision: It is also known as a positive predicted value. This is more based on
the prediction. It is a measure of the number of accurate positives that the model
claims when compared to the number of positives it actually claims.
13) What is your favorite algorithm and also explain the algorithm briefly in a
minute?
This type of question is very common and asked by the interviewers to understand
the candidate's skills and assess how well he can communicate complex theories in
the simplest language.
This one is a tough question and usually, individuals are not at all prepared for
this situation so please be prepared and have a choice of algorithms and make sure
you practice a lot before going into any sort of interviews.
Related Article - Machine Learning Applications
14) What is the difference between Type1 and Type2 errors?
Type 1 error is classified as a false positive. I.e. This error claims that
something has happened but the fact is nothing has happened. It is like a false
fire alarm. The alarm rings but there is no fire.
Type 2 error is classified as a false negative. I.e. This error claims that nothing
has happened but the fact is that actually, something happened at the instance.
The best way to differentiate a type 1 vs type 2 error is:
34) Explain how we can capture the correlation between continuous and categorical
variables?
Yes, it is possible by using the ANCOVA technique. It stands for Analysis of
Covariance. It is used to calculate the association between continuous and
categorical variables.
35) Explain the concept of machine learning and assume that you are explaining this
to a 5-year-old baby?
Yes, the question itself is the answer.
Machine learning is exactly the same way how babies do their day-to-day activities,
the way they walk or sleep, etc. It is a common fact that babies cannot walk
straight away and they fall and then they get up again and then try. This is the
same thing when it comes to machine learning, it is all about how the algorithm is
working and at the same time redefining every time to make sure the end result is
as perfect as possible.
One has to take real-time examples while explaining these questions.
36) What is the difference between Machine learning and Data Mining?
Data mining is about working on unstructured data and then extract it to a level
where interesting and unknown patterns are identified. Machine learning is a
process or a study whether it closely relates to the design, development of the
algorithms that provide an ability to the machines to capacity to learn.
37) What is inductive machine learning?
Inductive machine learning is all about a process of learning by live examples.
38) Please state a few popular Machine Learning algorithms?
Few popular Machine Learning algorithms are:
Nearest Neighbour
Neural Networks
Decision Trees etc
Support vector machines
39) What are the different types of algorithm techniques are available in machine
learning?
Some of them are :
Supervised learning
Unsupervised learning
Semi-supervised learning
Transduction
Learning to learn
40) What are the three stages to build the model in machine learning?
The three stages to build the model in machine learning is:
Model building
Model testing
Applying the model
False-positive rate
Type 1 Error - Type 1 error also called false positive, is asserting something true
when it is actually false.
Type 2 Error - Type 2 error also called false negative, is a test result indicating
that a condition is failed but in actuality it is successful.
44) What is the difference between machine learning and deep learning?
Deep learning is a subset of machine learning and is called so because it makes use
of deep neural networks. Let’s find out machine learning Vs Deep learning
Machine learningDeep LearningData dependenciesPerforms better on small and medium
datasetsWorks better for big datasetsHardware dependenciesWork on low-end
machinesRequires powerful machine, preferably with GPUInterpretabilityAlgorithms
are easy to interpretDifficult to interpretExecution timeFrom a few minutes to
hoursIt May take up to a weekFeature EngineeringNeed to understand the features
that represent the dataNo need to understand the best feature that represents the
data
45) What is Bayes Theorem and how it is used in machine learning?
Bayes theorem is a way of calculating conditional probability ie. finding the
probability of an event occurring based on the given probability of other events
that have already occurred. Mathematically, it is stated as -
P(A|B) = {P(B|A).P(A)}/P(B)
Bayes theorem has become a very useful tool in applied machine learning. It
provides a way of thinking about the relationship shared by data and the models.
A machine learning model is a specific way of thinking about the structured
relationship in the data such as relationships shared by input (x) and output (y).
If we have some prior domain knowledge about the hypothesis, Then the Bayes theorem
can help in solving machine learning problems.
46) What is cross-validation techniques would you be using on a time series
dataset?
Cross-validation is used for tuning the hyperparameters and producing measurements
of model performance. With the time series data, we can't use the traditional
cross-validation technique due to two main reasons which are as follows -
Temporal dependencies
Bagging
Stacking
Boosting
Scenario - suppose you want to buy a new pair of headphones. What will you do?
Being an aware consumer, first, you will do research on which company offers the
best headphones and also take some suggestions from your friends. In short, you
will be making informed decisions after thoroughly researching work.
53) What are the data types supported by JSON?
Here, the interviewer wants to test your knowledge of JSON. There are six basic
data types supported by JSON: strings, numbers, objects, arrays, booleans, and null
values.
54) According to you, what is the most valuable data in our business?
Through this question, the interviewer tries to test you on two dimensions: your
knowledge and understanding about business models and how you correlated data and
apply that thinking about the company. To answer this question, you’ll have to
research the business model, learn their business problems, and solve most with
their data.
55) Tell us about machine learning papers you’ve read lately?
To answer this question, you need to keep yourself updated with the latest
scientific literature on machine learning to demonstrate your interest in a machine
learning position.
56) What GPU/hardware do you use and what models do you train for?
This question tests if you have handled machine learning projects outside of a
corporate role and understand how to resource projects and allocate GPU time
efficiently. These kinds of questions are usually asked by hiring managers as they
want to know what you’ve done independently.
There are some general questions that an interviewer may ask you depending upon
your working experiences and awareness. Some of them are as follows -
We collect more data so that we can train the model with diverse samples.
We can avoid overfitting by using the ensembling methods, like Random Forest.
According to the bagging idea, we use them to minimize the change in the
projections by joining the result of the multiple decision trees on various samples
of the data set.
By Selecting the correct algorithm, we can avoid overfitting.