120-Data-Science-Interview-Questions
120-Data-Science-Interview-Questions
6. Can you cite some examples where both false positive and false negatives are equally
important?
Answer: In the Banking industry giving loans is the primary source of making money but at the same time
if your repayment rate is not good you will not make any profit, rather you will risk huge losses.
Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad
customers. In this scenario, both the false positives and false negatives become very important to
measure.
7. What is logistic regression? State an example when you have used logistic regression recently.
Answer: Logistic Regression often referred to as the logit model is a technique to predict the binary
outcome from a linear combination of predictor variables.
For example, if you want to predict whether a particular political leader will win the election or not. In this
case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the
amount of money spent on election campaigning of a particular candidate, the amount of time spent in
campaigning, etc.
12. How and by what methods data visualizations can be effectively used?
Answer: In addition to giving insights in a very effective and efficient manner, data visualization can also
be used in such a way that it is not only restricted to bar, line or some stereotypic graphs. Data can be
represented in a much more visually pleasing manner.
One thing has to be taken care of is to convey the intended insight or finding correctly to the audience.
Once the baseline is set. Innovative and creative part can help you come up with better looking and
functional dashboards. There is a fine line between the simple insightful dashboard and awesome looking
0 fruitful insight dashboards.
Depending on the amount or size of data, suitable tools or methods should be used to clean the data from
the database or big data environment. There are different types of data existing in a data source such as
dirty data, clean data, mixed clean and dirty data and sample clean data.
Modern data science applications rely on machine learning model where the learner learns from the
existing data. So, the existing data should always be cleanly and well maintained to get sophisticated and
good outcomes during the optimization of the system.
Power analysis lets you understand the sample size estimate so that they are neither high nor low. A low
sample size there will be no authentication to provide reliable answers and if it is large there will be
wastage of resources.
21. What tools or devices help you succeed in your role as a data scientist?
Answer: This question’s purpose is to learn the programming languages and applications the candidate
knows and has experience using. The answer will show the candidate’s need for additional training of
basic programming languages and platforms or any transferable skills. This is vital to understand as it can
cost more time and money to train if the candidate is not knowledgeable in all of the languages and
applications required for the position.
25. Can you enumerate the various differences between Supervised and Unsupervised Learning?
Answer: Supervised learning is a type of machine learning where a function is inferred from labeled
training data. The training data contains a set of training examples.
Unsupervised learning, on the other hand, is a type of machine learning where inferences are drawn from
datasets containing input data without labeled responses. Following are the various other differences
between the two types of machine learning:
Algorithms Used – Supervised learning makes use of Decision Trees, K-nearest Neighbor algorithm,
Neural Networks, Regression, and Support Vector Machines. Unsupervised learning uses Anomaly
Detection, Clustering, Latent Variable Models, and Neural Networks.
Enables – Supervised learning enables classification and regression, whereas unsupervised learning
enables classification, dimension reduction, and density estimation
Use – While supervised learning is used for prediction, unsupervised learning finds use in analysis
If it advances before handling the huge data, it is the best platform to engage Graphical Capacities: it
comes with functional graphical capacities and has a limited knowledge field.
It is useful to customize the plots Better tool management: It benefits in a release the updates with
regards to the controlled conditions.
This is the main reason why it is well tested. Whereas if you considered R&Python, it has open
contribution also the risk of errors in the current development is also high.
Python is mostly preferred in all the cases which is a general-purpose programming language and can be
found in many applications other than Data Science too. R is mostly seen in Data Science area only
where it is used for data analysis in standalone servers or computing separately.
46. What are outlier values and how do you treat them?
Answer: Outlier values, or simply outliers, are data points in statistics that don’t belong to a certain
population. An outlier value is an abnormal observation that is very much different from other values
belonging to the set.
Identification of outlier values can be done by using univariate or some other graphical analysis method.
Few outlier values can be assessed individually but assessing a large set of outlier values require the
substitution of the same with either the 99th or the 1st percentile values.
Less is more. Rather than pushing too much information on to readers brain, we need to figure out how
easily we can help them consume a dashboard or a chart.
The process is simple to say but difficult to implement. You must bring the complex business value out of
a self-explanatory chart. It’s a skill every data scientist should strive towards and good to have in their
arsenal.
50. What are the types of biases that can occur during sampling?
Answer: Some simple models of selection bias are described below. Undercoverage occurs when some
members of the population live badly represented inside the sample. … The survey relied on a service
unit, drawn of telephone directories and car registration lists.
• Selection bias
• Under coverage bias
• Survivorship bias
Concordance that helps identify the ability of the logistic model to differentiate between the event
happening and not happening.
Lift helps assess the logistic model by comparing it with random selection.
65. Can you explain the difference between a Test Set and a Validation Set?
Answer: Validation set can be considered as a part of the training set as it is used for parameter
selection and to avoid Overfitting of the model being built. On the other hand, the test set is used for
testing or evaluating the performance of a trained machine learning model.
In simple terms, the differences can be summarized as-
69. Python or R – Which one would you prefer for text analytics?
Answer: The best possible answer for this would be Python because it has Pandas library that provides
easy to use data structures and high-performance data analysis tools.
70. What is an Auto-Encoder?
Answer: The Auto-Encoders are learning networks that work for transforming the inputs into outputs with
no errors or minimized error. It means the output must be very close to the input. We add a few layers
between the input and output and the sizes of these layers would be smaller than the input layer. Actually,
the Auto-encoder is provided with the unlabelled input then it would be transmitted into reconstructing the
input.
73. Explain the difference between Univariate, Bivariate and Multivariate analysis?
Answer: Univariate analysis is a descriptive analysis and can be used to differentiate the number of
variables involved at a given point of time. For instance, the sales of a particular territory include only one
variable, and then the same is treated as a Univariate analysis.
Bivariate analysis is used to understand the difference between two variables at a given time on the
scatter pilot. The best example for bivariate analysis of the difference between the sale and expenses
happens for a particular product.
Multivariate analysis is used to understand the more than two variables responses for the variables.
74. What makes the difference between “Long” and “Wide” Format data?
Answer: In a wide format method, when we take a subject, the repeated responses are recorded in a
single row, and each recorded response is in a separate column. When it comes to Long format data,
each row acts as a one-time point per subject. In wide format, the columns are generally divided into
groups whereas in a long-form the rows are divided into groups.
75. Do we have different Selection Biases, if yes, what are they?
Answer: Sampling Bias: This bias arises when you select only particular people or when non-random
selection of samples happened. In general terms, it is nothing but a selection of the majority of the people
belong to one group.
Time Interval: sometimes a trial may be terminated earlier than actual time (probably due to some ethical
reasons) but the extreme value finally taken into consideration is the most significant value even though
all other variables have similar Mean.
Data: We can name it as a Data bias when a separate set of data is taken to support a conclusion or
eliminates terrible data based on the arbitrary grounds, instead of generally relying on generally stated
criteria.
Attrition bias: Attrition bias is defined as an error that occurs due to Unequal loss of participants from a
randomized controlled trial (RCT).
85. How does data cleaning plays a vital role in the analysis?
Answer: Data cleaning can help in the analysis because:
Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists
can work with.
Data Cleaning helps to increase the accuracy of the model in machine learning.
It is a cumbersome process because as the number of data sources increases, the time taken to clean
the data increases exponentially due to the number of sources and the volume of data generated by
these sources.
It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task.
86. Can you explain the difference between a Validation Set and a Test Set?
Answer: A Validation set can be considered as a part of the training set as it is used for parameter
selection and to avoid overfitting of the model being built.
On the other hand, a Test Set is used for testing or evaluating the performance of a trained machine
learning model.
In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights
and test set is to assess the performance of the model i.e. evaluating the predictive power and
generalization.
87. What do you mean by Deep Learning and Why has it become popular now?
Answer: Deep Learning is nothing but a paradigm of machine learning which has shown incredible
promise in recent years. This is because of the fact that Deep Learning shows a great analogy with the
functioning of the human brain.
Now although Deep Learning has been around for many years, the major breakthroughs from these
techniques came just in recent years.
Nevertheless, there are several reasons for using data cleaning in data analysis.
A Bivariate analysis deals with the relationship between two sets of data. These sets of paired data come
from related sources, or samples. There are various tools to analyze such data including the chi-squared
tests and t-tests when the data are having a correlation.
If the data can be quantified then it can be analyzed using a graph plot or a scatterplot. The strength of
the correlation between the two data sets will be tested in a Bivariate analysis.
The multivariate analysis deals with the study of more than two variables to understand the effect of
variables on the responses.
102. Can you cite some examples where a false negative important than a false positive?
Answer: 1: Assume there is an airport ‘A’ which has received high-security threats and based on certain
characteristics they identify whether a particular passenger can be a threat or not. Due to a shortage of
staff, they decide to scan passengers being predicted as risk positives by their predictive model. What will
happen if a true threat customer is being flagged as non-threat by airport model?
2: What if the Jury or judge decides to make a criminal go free?
3: What if you rejected to marry a very good person based on your predictive model and you happen to
meet him/her after a few years and realize that you had a false negative?
104. What do you understand by the Selection Bias? What are its various types?
Answer: Selection bias is typically associated with research that doesn’t have a random selection of
participants. It is a type of error that occurs when a researcher decides who is going to be studied. On
some occasions, selection bias is also referred to as the selection effect.
In other words, selection bias is a distortion of statistical analysis that results from the sample collecting
method. When selection bias is not taken into account, some conclusions made by a research study
might not be accurate. Following are the various types of selection bias:
Sampling Bias: A systematic error resulting due to a non-random sample of a populace causing certain
members of the same to be less likely included than others that results in a biased sample.
Time Interval – A trial might be ended at an extreme value, usually due to ethical reasons, but the
extreme value is most likely to be reached by the variable with the most variance, even though all
variables have a similar mean.
Data – Results when specific data subsets are selected for supporting a conclusion or rejection of bad
data arbitrarily.
Attrition – Caused due to attrition, i.e. loss of participants, discounting trial subjects or tests that didn’t run
to completion.
Get hands-on experience for your interviews with free access to the solved code example.
112. How has your prior experience prepared you for a role in data science?
Answer: This question helps determine the candidate’s experience from a holistic perspective and
reveals experience in demonstrating interpersonal, communication and technical skills. It is important to
understand this because data scientists must be able to communicate their findings, work in a team
environment and have the skills to perform the task.
Here are some possible answers to look for:
Project management skills
Examples of working in a team environment
Ability to identify errors
A substantial response may include the following: “My experience in my previous positions has prepared
me for this job by giving me the skills I need to work in a group setting, manage projects and quickly
identify errors.
115. Can you compare the validation set with the test set?
Answer: A validation set is part of the training set used for parameter selection as well as for avoiding
overfitting of the machine learning model being developed. On the contrary, a test set is meant for
evaluating or testing the performance of a trained machine learning model.
116. Please explain the concept of a Boltzmann Machine.
Answer: A Boltzmann Machine features a simple learning algorithm that enables the same to discover
fascinating features representing complex regularities present in the training data. It is basically used for
optimizing the quantity and weight for some given problem.
The simple learning algorithm involved in a Boltzmann Machine is very slow in networks that have many
layers of feature detectors.
118. Now companies are heavily investing their money and time to make the dashboards. Why?
Answer: To make stakeholders more aware of the business through data. Working on visualization
projects helps you develop one of the key skills every data scientist should possess i.e. Thinking from the
shoes of the end-user.
If you’re learning any visualization tool, download a dataset from kaggle. Building charts and graphs for
the dashboard should be the last step. Research more about the domain and think about the KPIs you
would like to see in the dashboard if you’re going to be the end-user. Then start building the dashboard
piece by piece.
SCHEDULE A CALL
E-mail: [email protected]
Telephone: +201122885566 / +201011933233 / +202-22749985
Location: Elserag mall, Building 1, entrance 1, floor 11, Nasr city 1,
Cairo, Egypt
We attempt to respond to queries in 24 hours or less. However, over weekends and holidays, our responses may
take up to 72 hours.