What Are The Types of Machine Learning?
What Are The Types of Machine Learning?
In all the ML Interview Questions that we would be going to discuss, this is one of the most basic
question.
Supervised Learning: In this type of the Machine Learning technique, machines learn under the
supervision of labeled data. There is a training dataset on which the machine is trained, and it
gives the output according to its training.
For example, imagine that we want to make predictions on the churning out customers for a
particular product based on some data recorded. Either the customers will churn out or they will
not. So, the labels for this would be ‘Yes’ and ‘No.’
Regression: It is the process of creating a model for distinguishing data into continuous real
values, instead of using classes or discrete values. It can also identify the distribution movement
depending on the historical data. It is used for predicting the occurrence of an event depending
on the degree of association of variables.
For example, the prediction of weather condition depends on factors such as temperature, air
pressure, solar radiation, elevation of the area, and distance from sea. The relation between
these factors assists us in predicting the weather condition.
Below is the best fit line that shows the data of weight (Y or the dependent variable) and height
(X or the independent variable) of 21-years-old candidates scattered over the plot. This straight
line shows the best linear relationship that would help in predicting the weight of candidates
according to their height.
To get this best fit line, we will try to find the best values of a and b. By adjusting the values
of a and b, we will try to reduce errors in the prediction of Y.
This is how linear regression helps in finding the linear relationship and predicting the output.
4. How will you determine the Machine Learning algorithm that is suitable for
your problem?
To identify the Machine Learning algorithm for our problem, we should follow the below steps:
If it is giving the output as a number, then we must use regression techniques and, if the output
is a different cluster of inputs, then we should use clustering techniques.
Step 2: Checking the algorithms in hand: After classifying the problem, we have to look for the
available algorithms that can be deployed for solving the classified problem.
Step 3: Implementing the algorithms: If there are multiple algorithms available, then we will
implement each one of them, one by one. Finally, we would select the algorithm that gives the
best performance.
Therefore, let’s learn about these technologies in detail so that you become capable of
differentiating between them:
K-means clustering: This algorithm is commonly used when you have data with no
specific group or category. It allows you to find the hidden patterns in the data that
can be used to classify them into various groups. The variable k is used to represent
the number of groups they are divided into, and the data points are clustered using
the similarity of features. Here, the centroids of the clusters are used for labeling new
data.
Mean-shift clustering: The main aim of this algorithm is to update the center point
candidates to be the mean and find the center points of all the groups. Unlike k-
means clustering, in this, you do not need to select the possible number of clusters as
it can automatically be discovered by the mean shift.
Density-based spatial clustering of applications with noise (DBSCAN): This
clustering is based on density and has similarities with mean-shift clustering. There is
no need to pre-set the number of clusters, but unlike mean-shift, it identifies outliers
and treats them like noise. Moreover, it can identify arbitrarily sized and shaped
clusters without much effort.
In the hypothesis, lowercase h (h) is used for a specific hypothesis, while uppercase h (H) is used
for the hypothesis space that is being searched. Let’s briefly understand these notations:
8. What are the differences between Deep Learning and Machine Learning?
Two of the most significant applications of the Bayes’ theorem in Machine Learning are Bayesian
optimization and Bayesian belief networks. This theorem is also the foundation behind the
Machine Learning brand that involves the Naive Bayes classifier.
Holdout method
K-fold cross-validation
Stratified k-fold cross-validation
Leave p-out cross-validation
In case there is more than one batch, d*e=i*b is the formula used, wherein ‘d’ is the dataset, ‘e’ is
the number of epochs, ‘i’ is the number of iterations, and ‘b’ is the batch size.
Bias is the difference between the average prediction of our model and the correct
value. If the bias value is high, then the prediction of the model is not accurate. Hence,
the bias value should be as low as possible to make the desired predictions.
Variance is the number that gives the difference of prediction over a training set and
the anticipated value of other training sets. High variance may lead to large
fluctuation in the output. Therefore, the model’s output should have low variance.
VIF = Variance of the model / Variance of the model with a single independent variable
We have to calculate this ratio for every independent variable. If VIF is high, then it shows the
high collinearity of the independent variables.
16. Explain false negative, false positive, true negative, and true positive with a
simple example.
True Positive (TP): When the Machine Learning model correctly predicts the condition, it is said
to have a True Positive value.
True Negative (TN): When the Machine Learning model correctly predicts the negative
condition or class, then it is said to have a True Negative value.
False Positive (FP): When the Machine Learning model incorrectly predicts a negative class or
condition, then it is said to have a False Positive value.
False Negative (FN): When the Machine Learning model incorrectly predicts a positive class or
condition, then it is said to have a False Negative value.
A confusion matrix gives the count of correct and incorrect values and also the error
types.Accuracy of the model:
For example, consider this confusion matrix. It consists of values as True Positive, True Negative,
False Positive, and False Negative for a classification model. Now, the accuracy of the model can
be calculated as follows:
This means that the model’s accuracy is 0.78, corresponding to its True Positive, True Negative,
False Positive, and False Negative values.
For example, a cricket match is going on and, when a batsman is not out, the umpire declares
that he is out. This is a false positive condition. Here, the test does not accept the true condition
that the batsman is not out.
Type II Error: Type II error (False Negative) is an error where the outcome of a test shows the
acceptance of a false condition.
For example, the CT scan of a person shows that he is not having a disease but, in reality, he is
having it. Here, the test accepts the false condition that the person is not having the disease.
The classification method is chosen over regression when the output of the model needs to yield
the belongingness of data points in a dataset to a particular category.
For example, we have some names of bikes and cars. We would not be interested in finding how
these names are correlated to bikes and cars. Rather, we would check whether each name
belongs to the bike category or to the car category.
Binary Logistic Regression: In this, there are only two outcomes possible.
Interested in learning Machine Learning? Click here to learn more in this Machine Learning
Training in Bangalore!
21. Imagine, you are given a dataset consisting of variables having more than
30% missing values. Let’s say, out of 50 variables, 8 variables have missing
values, which is higher than 30%. How will you deal with them?
To deal with the missing values, we will do the following:
isNull(): For detecting the missing values, we can use the isNull() method.
dropna(): For removing the columns/rows with null values, we can use the dropna()
method.
Also, we can use fillna() to fill the void values with a placeholder value.
In the real world, we deal with multi-dimensional data. Thus, data visualization and computation
become more challenging with the increase in dimensions. In such a scenario, we might have to
reduce the dimensions to analyze and visualize the data easily. We do this by:
Example: Below are the two graphs showing data points (objects) and two directions: one is
‘green’ and the other is ‘yellow.’ We got the Graph 2 by rotating the Graph 1 so that the x-axis
and y-axis represent the ‘green’ and ‘yellow’ directions, respectively.
After the rotation of the data points, we can infer that the green direction (x-axis) gives us the
line that best fits the data points.
Here, we are representing 2-dimensional data. But in real-life, the data would be multi-
dimensional and complex. So, after recognizing the importance of each direction, we can reduce
the area of dimensional analysis by cutting off the less-significant ‘directions.’
Now, we will look into another important Machine Learning Interview Question on PCA.
24. Why rotation is required in PCA? What will happen if you don’t rotate the
components?
Rotation is a significant step in PCA as it maximizes the separation within the variance obtained
by components. Due to this, the interpretation of components becomes easier.
The motive behind doing PCA is to choose fewer components that can explain the greatest
variance in a dataset. When rotation is performed, the original coordinates of the points get
changed. However, there is no change in the relative position of the components.
If the components are not rotated, then we need more extended components to describe the
variance.
25. We know that one hot encoding increases the dimensionality of a dataset,
but label encoding doesn’t. How?
When we use one hot encoding, there is an increase in the dimensionality of a dataset. The
reason for the increase in dimensionality is that, for every class in the categorical variables, it
forms a different variable.
Example: Suppose, there is a variable ‘Color.’ It has three sub-levels as Yellow, Purple, and
Orange. So, one hot encoding ‘Color’ will create three different variables as Color.Yellow,
Color.Porple, and Color.Orange.
In label encoding, the sub-classes of a certain variable get the value as 0 and 1. So, we use label
encoding only for binary variables.
This is the reason that one hot encoding increases the dimensionality of data and label encoding
does not.
Now, if you are interested in doing an end-to-end certification course in Machine Learning, you
can check out Intellipaat’s Machine Learning Course with Python.
26. What is Overfitting in Machine Learning and how can you avoid?
Overfitting happens when a machine has an inadequate dataset and it tries to learn from it. So,
overfitting is inversely proportional to the amount of data.
For small databases, we can bypass overfitting by the cross-validation method. In this approach,
we will divide the dataset into two sections. These two sections will comprise testing and training
sets. To train the model, we will use the training dataset and, for testing the model for new
inputs, we will use the testing dataset.
1. Training set: We use the training set for building the model and adjusting the model’s
variables. But, we cannot rely on the correctness of the model build on top of the
training set. The model might give incorrect outputs on feeding new inputs.
2. Validation set: We use a validation set to look into the model’s response on top of
the samples that don’t exist in the training dataset. Then, we will tune
hyperparameters on the basis of the estimated benchmark of the validation data.
When we are evaluating the model’s response using the validation set, we are indirectly training
the model with the validation set. This may lead to the overfitting of the model to specific data.
So, this model won’t be strong enough to give the desired response to the real-world data.
3. Test set: The test dataset is the subset of the actual dataset, which is not yet used to
train the model. The model is unaware of this dataset. So, by using the test dataset,
we can compute the response of the created model on hidden data. We evaluate the
model’s performance on the basis of the test dataset.
Note: We always expose the model to the test dataset after tuning the hyperparameters on top
of the validation set.
As we know, the evaluation of the model on the basis of the validation set would not be enough.
Thus, we use a test set for computing the efficiency of the model.
We can create an algorithm for a decision tree on the basis of the hierarchy of actions that we
have set.
In the above decision tree diagram, we have made a sequence of actions for driving a vehicle
with/without a license.
Here, we use dimensionality reduction to cut down the irrelevant and redundant features with
the help of principal variables. These principal variables are the subgroup of the parent variables
that conserve the feature of the parent variables.
31. Both being tree-based algorithms, how is Random Forest different from
Gradient Boosting Algorithm (GBM)?
The main difference between a random forest and GBM is the use of techniques. Random forest
advances predictions using a technique called ‘bagging.’ On the other hand, GBM advances
predictions with the help of a technique called ‘boosting.’
32. Suppose, you found that your model is suffering from high variance. Which
algorithm do you think could handle this situation and why?
Handling High Variance
For handling issues of high variance, we should use the bagging algorithm.
Bagging algorithm would split data into sub-groups with replicated sampling of
random data.
Once the algorithm splits the data, we use random data to create rules using a
particular training algorithm.
After that, we use polling for combining the predictions of the model.
In ROC, AUC (Area Under the Curve) gives us an idea about the accuracy of the model.
The above graph shows an ROC curve. Greater the Area Under the Curve better the
performance of the model.
We can rescale the data using Scikit-learn. The code for rescaling the data using MinMaxScaler is
as follows:
#Rescaling data
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
names = ['Abhi', 'Piyush', 'Pranay', 'Sourav', 'Sid', 'Mike', 'pedi', 'Jack',
'Tim']
Dataframe = pandas.read_csv(url, names=names)
Array = dataframe.values
# Splitting the array into input and output
X = array[:,0:8]
Y = array[:,8]
Scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# Summarizing the modified data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
Converting data into binary values on the basis of threshold values is known as the binarizing of
data. The values that are less than the threshold are set to 0 and the values that are greater than
the threshold are set to 1. This process is useful when we have to perform feature engineering,
and we can also use it for adding unique features.
We can binarize data using Scikit-learn. The code for binarizing the data using Binarizer is as
follows:
from sklearn.preprocessing import Binarizer
import pandas
import numpy
names = ['Abhi', 'Piyush', 'Pranay', 'Sourav', 'Sid', 'Mike', 'pedi', 'Jack',
'Tim']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# Splitting the array into input and output
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# Summarizing the modified data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])
We can standardize the data using Scikit-learn. The code for standardizing the data using
StandardScaler is as follows:
# Python code to Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
names = ['Abhi', 'Piyush', 'Pranay', 'Sourav', 'Sid', 'Mike', 'pedi', 'Jack',
'Tim']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# Separate the array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# Summarize the transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
37. Executing a binary classification tree algorithm is a simple task. But, how
does a tree splitting take place? How does the tree determine which variable to
break at the root node and which at its child nodes?
Gini index and Node Entropy assist the binary classification tree to take decisions. Basically, the
tree algorithm determines the feasible feature that is used to distribute data into the most
genuine child nodes.
According to Gini index, if we arbitrarily pick a pair of objects from a group, then they should be
of identical class and the possibility for this event should be 1.
1. Compute Gini for sub-nodes with the formula: The sum of the square of probability
for success and failure (p^2 + q^2)
2. Compute Gini for split by weighted Gini rate of every node of the split
When Entropy is high, both groups are present at 50–50 percent in the node.
Finally, to determine the suitability of the node as a root node, the entropy should be very low.
Come to Intellipaat’s Machine Learning Community if you have more queries on Machine
Learning Interview Questions!