Data Science Interview Questions
Data Science Interview Questions
Interview Question
Ans:
©Topperworld
©Topperworld
Ans: Data analysis can not be done on a whole volume of data at a time
especially when it involves larger datasets. It becomes crucial to take some
data samples that can be used for representing the whole population and
then perform analysis on it. While doing this, it is very much necessary to
carefully take sample data out of the huge data that truly represents the
entire dataset.
©Topperworld
There are majorly two categories of sampling techniques based on the usage
of statistics, they are:
Ans:
➢ Overfitting: The model performs well only for the sample training
data. If any new data is given as input to the model, it fails to provide
any result. These conditions occur due to low bias and high variance
in the model. Decision trees are more prone to overfitting.
➢ Underfitting: Here, the model is so simple that it is not able to identify
the correct relationship in the data, and hence it does not perform
well even on the test data. This can happen due to high bias and low
variance. Linear regression is more prone to Underfitting.
©Topperworld
©Topperworld
Ans:
©Topperworld
©Topperworld
Q 7. What does it mean when the p-values are high and low?
⚫ Low p-value which means values ≤ 0.05 means that the null hypothesis
can be rejected and the data is unlikely with true null.
⚫ High p-value, i.e values ≥ 0.05 indicates the strength in favor of the null
hypothesis. It means that the data is like with true null.
⚫ p-value = 0.05 means that the hypothesis can go either way.
It is also done in the cases where models need to be validated using random
subsets or when substituting labels on data points while performing tests.
©Topperworld
©Topperworld
Q 10. Are there any differences between the expected value and
mean value?
Ans: There are not many differences between these two, but it is to be noted
that these are used in different contexts. The mean value generally refers to
the probability distribution whereas the expected value is referred to in the
contexts involving random variables.
Ans: This bias refers to the logical error while focusing on aspects that
survived some process and overlooking those that did not work due to lack
of prominence. This bias can lead to deriving wrong conclusions.
©Topperworld
©Topperworld
Q 12.Define the terms KPI, lift, model fitting, robustness and DOE.
Ans:
➢ KPI: KPI stands for Key Performance Indicator that measures how well
the business achieves its objectives.
➢ Lift: This is a performance measure of the target model measured
against a random choice model. Lift indicates how good the model is at
prediction versus if there was no model.
➢ Model fitting: This indicates how well the model under consideration
fits given observations.
➢ Robustness: This represents the system ’ s capability to handle
differences and variances effectively.
➢ DOE: stands for the design of experiments, which represents the task
design aiming to describe and explain information variation under
hypothesized conditions to reflect variables.
©Topperworld
©Topperworld
Ans: The selection bias occurs in the case when the researcher has to make
a decision on which participant to study. The selection bias is associated with
those researches when the participant selection is not random. The selection
bias is also called the selection effect. The selection bias is caused by as a
result of the method of sample collection.
Ans:
©Topperworld
performs badly on the test data set. This may lead to over lifting as
well as high sensitivity.
As you can see from the image above, before the optimal point,
increasing the complexity of the model reduces the error (bias).
However, after the optimal point, we see that the increase in the
complexity of the machine learning model increases the variance.
©Topperworld
©Topperworld
So, the trade-off is simple. If we increase the bias, the variance will decrease
and vice versa.
Ans: It is a matrix that has 2 rows and 2 columns. It has 4 outputs that a
binary classifier provides to it. It is used to derive various measures like
specificity, error rate, accuracy, precision, sensitivity, and recall.
©Topperworld
©Topperworld
The test data set should contain the correct and predicted labels. The labels
depend upon the performance. For instance, the predicted labels are the
same if the binary classifier performs perfectly. Also, they match the part of
observed labels in real-world scenarios. The four outcomes shown above in
the confusion matrix mean the following:
The formulas for calculating basic measures that comes from the confusion
matrix are:
In these formulas:
FP = false positive
FN = false negative
TP = true positive
RN = true negative
Also,
Sensitivity is the measure of the True Positive Rate. It is also called recall.
Specificity is the measure of the true negative rate.
Precision is the measure of a positive predicted value.
F-score is the harmonic mean of precision and recall.
©Topperworld
©Topperworld
For example, let us say that we want to predict the outcome of elections for
a particular political leader. So, we want to find out whether this leader is
going to win the election or not. So, the result is binary i.e. win (1) or loss (0).
However, the input is a combination of linear variables like the money spent
on advertising, the past work done by the leader and the party, etc.
©Topperworld
©Topperworld
One such classification technique that is near the top of the classification
hierarchy is the random forest classifier.
So, we have the string with 5 ones and 4 zeroes and we want to classify the
characters of this string using their features.
These features are colour (red or green in this case) and whether the
observation (i.e. character) is underlined or not. Now, let us say that we are
only interested in red and underlined observations. S
©Topperworld
So, we started with the colour first as we are only interested in the red
observations and we separated the red and the green-coloured characters.
After that, the “No” branch i.e. the branch that had all the green coloured
characters was not expanded further as we want only red-underlined
characters. So, we expanded the “Yes” branch and we again got a “Yes”
and a “No” branch based on the fact whether the characters were
underlined or not.
So, this is how we draw a typical decision tree. However, the data in real life
is not this clean but this was just to give an idea about the working of the
decision trees.
➢ Random Forest
⚫ Build several decision trees on the samples of data and record their
predictions.
©Topperworld
⚫ Each time a split is considered for a tree, choose a random sample of
mm predictors as the split candidates out of all the pp predictors. This
happens to every tree in the random forest.
⚫ Apply the rule of thumb i.e. at each split m = p√m = p.
⚫ Apply the predictions to the majority rule.
Ans: Let us say that Prob is the probability that we may see a minimum of
one shooting star in 15 minutes.
Now, the probability that we may not see any shooting star in the time
duration of 15 minutes is = 1 - Prob
1-0.2 = 0.8
The probability that we may not see any shooting star for an hour is:
= (1-Prob)(1-Prob)(1-Prob)*(1-Prob)
= 0.8 * 0.8 * 0.8 * 0.8 = (0.8)⁴
≈ 0.40
So, the probability that we will see one shooting star in the time interval of
an hour is = 1-0.4 = 0.6
So, there are approximately 60% chances that we may see a shooting star in
the time span of an hour.
©Topperworld
Q 21. What is deep learning? What is the difference between deep
learning and machine learning?
The difference between machine learning and deep learning is that deep
learning is a paradigm or a part of machine learning that is inspired by the
structure and functions of the human brain called the artificial neural
networks.
©Topperworld
Ans:
©Topperworld
So, if a person is climbing down the hill, the next position that the
climber has to come to is denoted by “b” in this equation. Then,
there is a minus sign because it denotes the minimization (as gradient
descent is a minimization algorithm). The Gamma is called a waiting
factor and the remaining term which is the Gradient term itself shows
the direction of the steepest descent.
Ans:
©Topperworld
©Topperworld
Ans:
➢ RMSE: RMSE stands for Root Mean Square Error. In a linear regression
model, RMSE is used to test the performance of the machine learning
model. It is used to evaluate the data spread around the line of
best fit. So, in simple words, it is used to measure the deviation of the
residuals.
➢ MSE: Mean Squared Error is used to find how close is the line to the
actual data. So, we make the difference in the distance of the data
points from the line and the difference is squared. This is done for all
the data points and the submission of the squared difference divided
by the total number of data points gives us the Mean Squared Error
(MSE).
©Topperworld
⚫ Yi is the actual value of the output variable (the ith data
point)
⚫ Y(cap) is the predicted value and,
⚫ N is the total number of data points.
In the above diagram, we can see that the thin lines mark the distance from
the classifier to the closest data points (darkened data points). These are
called support vectors. So, we can define the support vectors as the data
©Topperworld
©Topperworld
points or vectors that are nearest (closest) to the hyperplane. They affect
the position of the hyperplane. Since they support the hyperplane, they are
known as support vectors.
Q 26. So, you have done some projects in machine learning and data
science and we see you are a bit experienced in the field. Let’s say
your laptop’s RAM is only 4GB and you want to train your model on
10GB data set.
What will you do? Have you experienced such an issue before?
1. The Numpy array can be used to load the entire data. It will never store
the entire data, rather just create a mapping of the data.
2. Now, in order to get some desired data, pass the index into the NumPy
Array.
3. This data can be used to pass as an input to the neural network
maintaining a small batch size.
1. For SVM, small data sets can be obtained. This can be done by dividing
the big data set.
2. The subset of the data set can be obtained as an input if using the
partial fit function.
3. Repeat the step of using the partial fit method for other subsets as
well.
Now, you may describe the situation if you have faced such an issue in your
projects or working in machine learning/ data science.
©Topperworld
Q 27. Explain Neural Network Fundamentals.
Ans: In the human brain, different neurons are present. These neurons
combine and perform various tasks. The Neural Network in deep learning
tries to imitate human brain neurons. The neural network learns the patterns
from the data and uses the knowledge that it gains from various patterns to
predict the output for new data, without any human assistance.
There are some other neural networks that are more complicated. Such
networks consist of the following three layers:
) Input Layer: The neural network has the input layer to receive the
input.
) Hidden Layer: There can be multiple hidden layers between the input
layer and the output layer. The initially hidden layers are used for
detecting the low-level patterns whereas the further layers are
responsible for combining output from previous layers to find more
patterns.
) Output Layer: This layer outputs the prediction.
Ans: This approach can be understood with the famous example of the wine
seller. Let us say that there is a wine seller who has his own shop. This wine
seller purchases wine from the dealers who sell him the wine at a low cost so
that he can sell the wine at a high cost to the customers.
Now, let us say that the dealers whom he is purchasing the wine from, are
selling him fake wine. They do this as the fake wine costs way less than the
©Topperworld
original wine and the fake and the real wine are indistinguishable to a normal
consumer (customer in this case).
The shop owner has some friends who are wine experts and he sends his
wine to them every time before keeping the stock for sale in his shop. So, his
friends, the wine experts, give him feedback that the wine is probably fake.
Since the wine seller has been purchasing the wine for a long time from the
same dealers, he wants to make sure that their feedback is right before he
complains to the dealers about it. Now, let us say that the dealers also have
got a tip from somewhere that the wine seller is suspicious of them.
So, in this situation, the dealers will try their best to sell the fake wine
whereas the wine seller will try his best to identify the fake wine. Let us see
this with the help of a diagram shown below:
From the image above, it is clear that a noise vector is entering the generator
(dealer) and he generates the fake wine and the discriminator has to
distinguish between the fake wine and real wine. This is a Generative
Adversarial Network (GAN).
In a GAN, there are 2 main components viz. Generator and Discrminator. So,
the generator is a CNN that keeps producing images and the discriminator
tries to identify the real images from the fake ones.
©Topperworld
©Topperworld
Multiple layers are added between the input and the output layer and the
layers that are in between the input and the output layer are smaller than
the input layer. It received unlabelled input. This input is encoded to
reconstruct the input later.
©Topperworld
ABOUT US
➢ Our Vision
❖ Our vision is to create a world where every college student can easily
access high-quality educational content, connect with peers, and achieve
their academic goals.
❖ We believe that education should be accessible, affordable, and engaging,
and that's exactly what we strive to offer through our platform.
❖ Education is not just about textbooks and lectures; it's also about forming
connections and growing together.
©Topperworld
❖ TopperWorld encourages you to engage with your fellow students, ask
questions, and share your knowledge.
❖ We believe that collaborative learning is the key to academic success.
©Topperworld
“Unlock Your
Potential”
With- Topper World
Explore More
topperworld.in
Follow Us On
E-mail
[email protected]