Information Security Awareness - Refresher Course
Information Security Awareness - Refresher Course
Try out the code snippets given for the case study.
Introduction
Unstructured data, as the name suggests, does not have a structured format and may
contain data such as dates, numbers or facts.
*This results in irregularities and ambiguities which make it difficult to understand using
traditional programs when compared to data stored in fielded form in databases or
annotated (semantically tagged) in documents.
Source : Wikipedia.
Emails
PDF files
Spreadsheets
Digital Images
Video
Audio
Problem Description
Let us understand unstructured data classification through the following case study:
In our day-to-day lives, we receive a large number of spam/junk messages either in the
form of Text(SMS) or E-mails. It is important to filter these spam messages since they
are not truthful or trustworthy.
In this case study, we apply various machine learning algorithms to categorize the
messages depending on whether they are spam or not.
Your Playground
You can try your hands-on exercises using Katacoda or having the coding setup done
on your local machine.
You can use the Python editor (by default you have app.py file) for trying out the
code snippets given in this course.
You can execute the Python code by clicking the Run command from the left
pane.
Your Playground...
Note: In case you don't find any of the required packages while playing around with the
case study, you can do the following :
pip install nltk --target=./.Here, for Eg: nltk is the package you need to
download.
For NLTK, you have a few other dependent packages. You can perform the
following steps to download them :
o Type nltk.download()
Note: You can find brief descriptions of the python packages here.
Dataset Download
curl
https://www.researchgate.net/profile/Tiago_Almeida4/publication/258050002_SMS_Spam_Co
llection_v1/data/00b7d526d127ded162000000/SMSSpamCollection.txt>dataset.csv
Dataset Description
The dataset contains customer usage pattern of a telecommunication company.
Data Loading
To start with data loading, import the required python package and load the downloaded
CSV file.
The data can be stored as dataframe for easy data manipulation/analysis. Pandas is
one of the most widely used libraries for this.
import pandas as pd
import csv
#Data Loading
printlen(messages)
As you can see, our dataset has 2 columns without any headers.
This code snippet reads the data using pandas and labels the column names
as label and message.
Data Analysis
Analyzing data is a must in any classification problem. The goal of data analysis is to
derive useful information from the given data for making decisions.
In this section, we will analyze the dataset in terms of size, headers, view data summary
and a sample data.
data_size=messages.shape
print(data_size)
messages_col_names=list(messages.columns)
print(messages_col_names)
print(messages.groupby('label').describe())
To see a sample data, use the following command :
print(messages.head(3))
Target Identification
In this case, you aim to identify whether the message is spam or not.
By observing the columns, the label column has values Spam or Ham . We can
call this case study a Binary Classification, since it has only two possible
outcomes.
message_target=messages['label']
print(message_target)
Tokenization
In Natural Language Processing (NLP), tokenization is the initial step in. Splitting a
sentence into tokens helps to remove unwanted information in the raw text such as
white spaces, line breaks and so on.
importnltk
defsplit_tokens(message):
message=message.lower()
word_tokens =word_tokenize(message)
returnword_tokens
Lemmatization
defstopword_removal(message):
stop_words = set(stopwords.words('english'))
filtered_sentence = []
returnfiltered_sentence
messages['preprocessed_message'] = messages.apply(lambda row:
stopword_removal(row['lemmatized_message']),axis=1)
Training_data=pd.Series(list(messages['preprocessed_message']))
Training_label=pd.Series(list(messages['label']))
We will be looking into a few specific ones used for unstructured data.
Bag Of Words(BOW)
Bag of Words (BOW) is one of the most widely used methods for generating
features in Natural Language Processing.
The Term Document Matrix (TDM) is a matrix that contains the frequency of
occurrence of terms in a collection of documents.
In a TDM, the rows represent documents and columns represent the terms.
message_data_TDM = Total_Dictionary_TDM.transform(Training_data)
IDF diminishes the weight of the most commonly occurring words and increases
the weightage of rare words.
Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)
message_data_TFIDF = Total_Dictionary_TFIDF.transform(Training_data)
Let's take the TDM matrix for further evaluation. You can also try out the same using
TFIDF matrix.
Which preprocessing technique is used to remove the most commonly used
words? Tokenization Lemmatization Stopword removal
Classification Algorithms
There are various algorithms to solve the classification problems. The code to try out a
few of these algorithms will be presented in the upcoming cards.
Note:- The explanation for these algorithms are given in the Machine Learning
Axioms course. Refer the course for further details.
3. Predict the target - Given an unlabeled observation X, the predict(X) returns the
predicted label y.
4. Evaluate the classifier model - The score(X,y) returns the score for the given test
data X and test label y.
The decision tree model predicts the class/target by learning simple decision
rules from the features of the data.
Here, a random forest fits a number of decision tree classifiers on various sub-
samples of the dataset and uses averaging to improve the predictive accuracy.
Model Tuning
The classification algorithms in machine learning are parameterized. Modifying any of
those parameters can influence the results. So algorithm/model tuning is essential to
find out the best model.
For example, let's take the Random Forest Classifier and change the values of a few
parameters (n_ estimators,max_ features)
Split the data to train set, validation set and test set.
o Validation Set: The data used to tune the classifier model parameters i.e.,
to understand how well the model has been trained (a part of training
data).
o Testing Set: The data used to evaluate the performance of the classifier
(unseen data by the classifier).
Cross Validation
Cross validation is a model validation technique to evaluate the performance of
a model on unseen data (validation set).
Points to remember:
Cross validation gives high variance if the testing set and training set are not
drawn from same population.
Allowing training data to be included in testing data will not give actual
performance results.
In cross validation, the number of samples used for training the model is reduced and
the results depend on the choice of the pair of training and testing sets.
StratifiedShuffleSplit would suit our case study as the dataset has a class imbalance
which can be seen from the following code snippet:
seed=7
classifiers = [
DecisionTreeClassifier(),
SGDClassifier(loss='modified_huber', shuffle=True),
SVC(kernel="linear", C=0.025),
KNeighborsClassifier(),
OneVsRestClassifier(svm.LinearSVC()),
forclf in classifiers:
score=0
clf.fit(X_train, y_train)
score=score+clf.score(X_test, y_test)
print(score)
The above code uses ensemble of classifiers for cross validation. It helps to select the
best classifier based on the cross validation scores. The classifier with the highest
score can be used for building the classification model.
Classification Accuracy
The classification accuracy is defined as the percentage of correct predictions.
print('Accuracy Score',accuracy_score(test_label,message_predicted_target))
score=classifier.score(test_data, test_label)
test_label.value_counts()
This simple classification accuracy will not tell us the types of errors by our
classifier.
It is just an easier method, but it will not give us the latent distribution of response
values.
Confusion Matrix
It is a technique to evaluate the performance of a classifier.
The rows and columns of the table show the count of false positives, false
negatives, true positives and true negatives.
print('Confusion Matrix',confusion_matrix(test_label,message_predicted_target))
The first parameter shows true values and the second parameter shows predicted
values.
Confusion Matrix
This image is a confusion matrix for a two class classifier.
In the table,
For our case study, we have plotted the confusion matrix of Decision Tree Classifier
which is given in the above image.
Classification Report
The classification_report function shows a text report with the commonly used
classification metrics.
print(classification_report(test_label, message_predicted_target,
target_names=target_names))
Precision
Recall
It is the true positive rate.
When the value is positive, how often does the prediction turn out to be correct?
Other Libraries
For our demonstration purpose, we have used Python with NLTK. There are many more
libraries specific to Java/Ruby, etc.
NLP Libraries
True Negative is when the predicted instance and the actual is positive.
TRUE OR FALSE
True Positive is when the predicted instance and the actual instance is not
negative. TRUE OR FALSE
Q&A
To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.get(3)
sentiment_analysis_data.select(3)
sentiment_analysis_data.top(3)
sentiment_analysis_data.he
General strategies[edit]
The existing multi-class classification techniques can be categorized into (i) Transformation to binary
(ii) Extension from binary and (iii) Hierarchical classification.[1]
Transformation to binary[edit]
This section discusses strategies for reducing the problem of multiclass classification to multiple
binary classification problems. It can be categorized into One vs Rest and One vs One. The
techniques developed based on reducing the multi-class problem into multiple binary problems can
also be called problem transformation techniques.
One-vs.-rest[edit]
One-vs.-rest[2]:182, 338 (or one-vs.-all, OvA or OvR, one-against-all, OAA) strategy involves training a
single classifier per class, with the samples of that class as positive samples and all other samples
as negatives. This strategy requires the base classifiers to produce a real-valued confidence score
for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities,
where multiple classes are predicted for a single sample.[3]:182[note 1]
In pseudocode, the training algorithm for an OvA learner constructed from a binary classification
learner L is as follows:
Inputs:
Although this strategy is popular, it is a heuristic that suffers from several problems.
Firstly, the scale of the confidence values may differ between the binary classifiers.
Second, even if the class distribution is balanced in the training set, the binary
classification learners see unbalanced distributions because typically the set of
negatives they see is much larger than the set of positives.[3]:338
One-vs.-one[edit]
In the one-vs.-one (OvO) reduction, one trains K (K − 1) / 2 binary classifiers for
a K-way multiclass problem; each receives the samples of a pair of classes from the
original training set, and must learn to distinguish these two classes. At prediction
time, a voting scheme is applied: all K (K − 1) / 2 classifiers are applied to an
unseen sample and the class that got the highest number of "+1" predictions gets
predicted by the combined classifier.[3]:339
Like OvR, OvO suffers from ambiguities in that some regions of its input space may
receive the same number of votes.[3]:183
Hierarchical classification[edit]
Hierarchical classification tackles the multi-class classification problem by dividing
the output space i.e. into a tree. Each parent node is divided into multiple child
nodes and the process is continued until each child node represents only one class.
Several methods have been proposed based on hierarchical classification.
Learning paradigms[edit]
Based on learning paradigms, the existing multi-class classification techniques can
be classified into batch learning and online learning. Batch learning algorithms
require all the data samples to be available beforehand. It trains the model using the
entire training data and then predicts the test sample using the found relationship.
The online learning algorithms, on the other hand, incrementally build their models
in sequential iterations. In iteration t, an online algorithm receives a sample, xt and
predicts its label ŷt using the current model; the algorithm then receives yt, the true
label of xt and updates its model based on the sample-label pair: (xt, yt). Recently, a
new learning paradigm called progressive learning technique has been
developed.[4] The progressive learning technique is capable of not only learning from
new samples but also capable of learning new classes of data and yet retain the
knowledge learnt thus far.
The Term Document Matrix (TDM) is a matrix that contains the frequency of occurrence of
terms in a collection of documents. In a TDM, the rows represent documents and columns
represent the terms.
To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.get(3)(X)
sentiment_analysis_data.select(3)
sentiment_analysis_data.head(3)
sentiment_analysis_data.top(3)
Classification Algorithms
A technique used to depict the performance in a tabular form that has 2 dimensions
namely “actual” and “predicted” sets of data.
Higher value of which of the following hyperparameters is better for decision tree
algorithm?
Number of samples used for split Depth of tree
Cannot say Samples for leaf
Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not anhyperparameter
in random forest. Increase in the number of tree will cause under fitting.
27/07/18 (2)
What is the output of the sentence “Good words bring good feelings to the heart”
after performing tokenization, lemmatization and stop word removal.
'Good words bring good feelings heart'
['Good', 'words', 'bring', 'good', 'feelings', 'to', 'the', 'heart']
['Good', 'word', 'bring', 'good', 'feeling', 'to', 'the', 'heart']
'Good word bring good feeling heart'
26/07/18 (3)
To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.get(3)
sentiment_analysis_data.head(3)
sentiment_analysis_data.top(3)
sentiment_analysis_data.select(3)
XXX
SVM is a
weakly supervised learning algorithm.
Semi-supervised learning algorithm.
supervised learning algorithm.
unsupervised learning algorithm.
Choose the correct sequence for classifier building from the following:
None of the options
Train -> Test -> Initialize ->Predict
Initialize -> Evaluate -> Train -> Predict
Initialize -> Train - -> Predict-->Evaluate
27/07/18 (1)
The data you have is called 'mixed data' because it has both numerical and categorical values. And since
you have class labels; therefore, it is a classification problem. One option is to go with decision trees,
which you already tried. Other possibilities are naive Bayes where you model numeric attributes by a
Gaussian distribution or so. You can also employ a minimum distance or KNN based approach; however,
the cost function must be able to handle data for both types together. If these approaches don't work
then try ensemble techniques. Try bagging with decision trees or else Random Forest that combines
bagging and random subspace. With mixed data, choices are limited and you need to be cautious and
creative with your choices.
'sentiment_analysis_data'.
To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.head(3) sentiment_analysis_data.select(3)
sentiment_analysis_data.get(3) sentiment_analysis_data.top(3)
Which of the following command is used to view the dataset SIZE and what is the
value returned?
sentiment_analysis_data.shape,(7086, 3)
sentiment_analysis_data.shape(),(7086, 2)
sentiment_analysis_data.size(),(7086, 2)
sentiment_analysis_data.size,(7086, 3)
A technique used to depict the performance in a tabular form that has 2 dimensions
namely “actual” and “predicted” sets of data.
Classification Report Classification Accuracy
Confusion Matrix Cross Validation
In a TDM, the rows represent documents and columns represent the terms.
Which of the following command is used to view the dataset SIZE and what is the value
returned?
sentiment_analysis_data.size,(7086, 3) sentiment_analysis_data.shape(),(7086, 2)
sentiment_analysis_data.size(),(7086, 2) sentiment_analysis_data.shape,(7086, 3)
1. Classification
2. Clustering
3. Reinforcement Learning
4. Regression
Options:
B. A. 2 Only
C. 1 and 2
D. 1 and 3
E. 2 and 3
F. 1, 2 and 3
H. 1, 2, 3 and 4
Solution: (E)
Generally, movie recommendation systems cluster the users in a finite number of similar groups
based on their previous activities and profile. Then, at a fundamental level, people in the same
cluster are made similar recommendations.
In some scenarios, this can also be approached as a classification problem for assigning the most
appropriate movie class to the user of a specific group of users. Also, a movie recommendation
system can be viewed as a reinforcement learning problem where it learns by its previous
recommendations and improves the future recommendations.
1. Regression
2. Classification
3. Clustering
4. Reinforcement Learning
Options:
A. 1 Only
B. 1 and 2
C. 1 and 3
D. 1, 2 and 3
E. 1, 2 and 4
F. 1, 2, 3 and 4
Solution: (E)
Sentiment analysis at the fundamental level is the task of classifying the sentiments represented
in an image, text or speech into a set of defined sentiment classes like happy, sad, excited,
positive, negative, etc. It can also be viewed as a regression problem for assigning a sentiment
score of say 1 to 10 for a corresponding image, text or speech.
Another way of looking at sentiment analysis is to consider it using a reinforcement learning
perspective where the algorithm constantly learns from the accuracy of past sentiment analysis
performed to improve the future performance.
A. True
B. False
Solution: (A)
Decision trees can also be used to for clusters in the data but clustering often generates natural
clusters and is not dependent on any objective function.
Q4. Which of the following is the most appropriate strategy for data cleaning before
performing clustering analysis, given less than desirable number of data points:
Options:
A. 1 only
B. 2 only
C. 1 and 2
Solution: (A)
Removal of outliers is not recommended if the data points are few in number. In this scenario,
capping and flouring of variables is the most appropriate strategy.
Q5. What is the minimum no. of variables/ features required to perform clustering?
A. 0
B. 1
C. 2
D. 3
Solution: (B)
At least a single variable is required to perform clustering analysis. Clustering analysis with a
single variable can be visualized with the help of a histogram.
Q6. For two runs of K-Mean clustering is it expected to get same clustering results?
A. Yes
B. No
Solution: (B)
K-Means clustering algorithm instead converses on local minima which might also correspond to
the global minima in some cases but not always. Therefore, it’s advised to run the K-Means
algorithm multiple times before drawing inferences about the clusters.
However, note that it’s possible to receive same clustering results from K-means by setting the
same seed value for each run. But that is done by simply making the algorithm choose the set of
same random no. for each run.
Q7. Is it possible that Assignment of observations to clusters does not change between
successive iterations in K-Means
A. Yes
B. No
C. Can’t say
D. None of these
Solution: (A)
When the K-Means algorithm has reached the local or global minima, it will not alter the
assignment of data points to clusters for two successive iterations.
Q8. Which of the following can act as possible termination conditions in K-Means?
Options:
A. 1, 3 and 4
B. 1, 2 and 3
C. 1, 2 and 4
Solution: (D)
All four conditions can be used as possible termination condition in K-Means clustering:
1. This condition limits the runtime of the clustering algorithm, but in some cases the
quality of the clustering will be poor because of an insufficient number of iterations.
2. Except for cases with a bad local minimum, this produces a good clustering, but runtimes
may be unacceptably long.
3. This also ensures that the algorithm has converged at the minima.
4. Terminate when RSS falls below a threshold. This criterion ensures that the clustering is
of a desired quality after termination. Practically, it’s a good practice to combine it with a
bound on the number of iterations to guarantee termination.
Q9. Which of the following clustering algorithms suffers from the problem of convergence
at local optima?
Options:
A. 1 only
B. 2 and 3
C. 2 and 4
D. 1 and 3
E. 1,2 and 4
Solution: (D)
Out of the options given, only K-Means clustering algorithm and EM clustering algorithm has
the drawback of converging at local minima.
Solution: (A)
Out of all the options, K-Means clustering algorithm is most sensitive to outliers as it uses the
mean of cluster data points to find the cluster center.
Q11. After performing K-Means Clustering analysis on a dataset, you observed the
following dendrogram. Which of the following conclusion can be drawn from the
dendrogram?
A. There were 28 data points in clustering analysis
D. The above dendrogram interpretation is not possible for K-Means clustering analysis
Solution: (D)
A dendrogram is not possible for K-Means clustering analysis. However, one can create a cluster
gram based on K-Means clustering analysis.
Q12. How can Clustering (Unsupervised Learning) be used to improve the accuracy of
Linear Regression model (Supervised Learning):
A. 1 only
B. 1 and 2
C. 1 and 4
D. 3 only
E. 2 and 4
Solution: (F)
Creating an input feature for cluster ids as ordinal variable or creating an input feature for cluster
centroids as a continuous variable might not convey any relevant information to the regression
model for multidimensional data. But for clustering in a single dimension, all of the given
methods are expected to convey meaningful information to the regression model. For example,
to cluster people in two groups based on their hair length, storing clustering ID as ordinal
variable and cluster centroids as continuous variables will convey meaningful information.
Q13. What could be the possible reason(s) for producing two different dendrograms using
agglomerative clustering algorithm for the same dataset?
C. of variables used
D. B and c only
Solution: (E)
Change in either of Proximity function, no. of data points or no. of variables will lead to different
clustering results and hence different dendrograms.
Q14. In the figure below, if you draw a horizontal line on y-axis for y=2. What will be the
number of clusters formed?
A. 1
B. 2
C. 3
D. 4
Solution: (B)
Since the number of vertical lines intersecting the red horizontal line at y=2 in the dendrogram
are 2, therefore, two clusters will be formed.
Q15. What is the most appropriate no. of clusters for the data points represented by the
following dendrogram:
A. 2
B. 4
C. 6
D. 8
Solution: (B)
The decision of the no. of clusters that can best depict different groups can be chosen by
observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the
dendrogram cut by a horizontal line that can transverse the maximum distance vertically without
intersecting a cluster.
In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.
Q16. In which of the following cases will K-Means clustering fail to give good results?
Options:
A. 1 and 2
B. 2 and 3
C. 2 and 4
D. 1, 2 and 4
E. 1, 2, 3 and 4
Solution: (D)
K-Means clustering algorithm fails to give good results when the data contains outliers, the
density spread of data points across the data space is different and the data points follow non-
convex shapes.
Q17. Which of the following metrics, do we have for finding dissimilarity between two
clusters in hierarchical clustering?
1. Single-link
2. Complete-link
3. Average-link
Options:
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. 1, 2 and 3
Solution: (D)
All of the three methods i.e. single link, complete link and average link can be used for finding
dissimilarity between two clusters in hierarchical clustering.
Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of them
Solution: (A)
Clustering analysis is not negatively affected by heteroscedasticity but the results are negatively
impacted by multicollinearity of features/ variables used in clustering as the correlated feature/
variable will carry extra weight on the distance calculation than desired.
A.
B.
C.
D.
Solution: (A)
For the single link or MIN version of hierarchical clustering, the proximity of two clusters is
defined to be the minimum of the distance between any two points in the different clusters. For
instance, from the table, we see that the distance between points 3 and 6 is 0.11, and that is the
height at which they are joined into one cluster in the dendrogram. As another example, the
distance between clusters {3, 6} and {2, 5} is given by dist({3, 6}, {2, 5}) = min(dist(3, 2),
dist(6, 2), dist(3, 5), dist(6, 5)) = min(0.1483, 0.2540, 0.2843, 0.3921) = 0.1483.
Which of the following clustering representations and dendrogram depicts the use of MAX
or Complete link proximity function in hierarchical clustering:
A.
B.
C.
D.
Solution: (B)
For the single link or MAX version of hierarchical clustering, the proximity of two clusters is
defined to be the maximum of the distance between any two points in the different clusters.
Similarly, here points 3 and 6 are merged first. However, {3, 6} is merged with {4}, instead of
{2, 5}. This is because the dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4)) = max(0.1513, 0.2216) =
0.2216, which is smaller than dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))
= max(0.1483, 0.2540, 0.2843, 0.3921) = 0.3921 and dist({3, 6}, {1}) = max(dist(3, 1), dist(6,
1)) = max(0.2218, 0.2347) = 0.2347.
A.
B.
C.
D.
Solution: (C)
For the group average version of hierarchical clustering, the proximity of two clusters is defined
to be the average of the pairwise proximities between all pairs of points in the different clusters.
This is an intermediate approach between MIN and MAX. This is expressed by the following
equation:
Here, the distance between some clusters. dist({3, 6, 4}, {1}) = (0.2218 + 0.3688 + 0.2347)/(3 ∗
1) = 0.2751. dist({2, 5}, {1}) = (0.2357 + 0.3421)/(2 ∗ 1) = 0.2889. dist({3, 6, 4}, {2, 5}) =
(0.1483 + 0.2843 + 0.2540 + 0.3921 + 0.2042 + 0.2932)/(6∗1) = 0.2637. Because dist({3, 6, 4},
{2, 5}) is smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}), these two clusters are merged at
the fourth stage
A.
B.
C.
D.
Solution: (D)
Ward method is a centroid method. Centroid method calculates the proximity between two
clusters by calculating the distance between the centroids of clusters. For Ward’s method, the
proximity between two clusters is defined as the increase in the squared error that results when
two clusters are merged. The results of applying Ward’s method to the sample data set of six
points. The resulting clustering is somewhat different from those produced by MIN, MAX, and
group average.
Q23. What should be the best choice of no. of clusters based on the following results:
A. 1
B. 2
C. 3
D. 4
Solution: (C)
The silhouette coefficient is a measure of how similar an object is to its own cluster compared to
other clusters. Number of clusters for which silhouette coefficient is highest represents the best
choice of the number of clusters.
Q24. Which of the following is/are valid iterative strategy for treating missing values before
clustering analysis?
Solution: (C)
All of the mentioned techniques are valid for treating missing values before clustering analysis
but only imputation with EM algorithm is iterative in its functioning.
Q25. K-Mean algorithm has some limitations. One of the limitation it has is, it makes hard
assignments(A point either completely belongs to a cluster or not belongs at all) of points to
clusters.
Note: Soft assignment can be consider as the probability of being assigned to each cluster:
say K = 3 and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)
Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of these
Solution: (C)
Both, Gaussian mixture models and Fuzzy K-means allows soft assignments.
Q26. Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering
algorithm. After first iteration clusters, C1, C2, C3 has following observations:
What will be the cluster centroids if you want to proceed for second iteration?
D. None of these
Solution: (A)
Q27. Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering
algorithm. After first iteration clusters, C1, C2, C3 has following observations:
What will be the Manhattan distance for observation (9, 9) from cluster centroid C1. In
second iteration.
A. 10
B. 5*sqrt(2)
C. 13*sqrt(2)
D. None of these
Solution: (A)
Manhattan distance between centroid C1 i.e. (4, 4) and (9, 9) = (9-4) + (9-4) = 10
Q28. If two variables V1 and V2, are used for clustering. Which of the following are true
for K means clustering with k =3?
1. If V1 and V2 has a correlation of 1, the cluster centroids will be in a straight line
2. If V1 and V2 has a correlation of 0, the cluster centroids will be in straight line
Options:
A. 1 only
B. 2 only
C. 1 and 2
Solution: (A)
If the correlation between the variables V1 and V2 is 1, then all the data points will be in a
straight line. Hence, all the three cluster centroids will form a straight line as well.
Q29. Feature scaling is an important step before applying K-Mean algorithm. What is
reason behind this?
A. In distance calculation it will give the same weights for all features
B. You always get the same clusters. If you use or don’t use feature scaling
D. None of these
Solution; (A)
Feature scaling ensures that all the features get same weight in the clustering analysis. Consider a
scenario of clustering people based on their weights (in KG) with range 55-110 and height (in
inches) with range 5.6 to 6.4. In this case, the clusters produced without scaling can be very
misleading as the range of weight is much higher than that of height. Therefore, its necessary to
bring them to same scale so that they have equal weightage on the clustering result.
Q30. Which of the following method is used for finding optimal of cluster in K-Mean
algorithm?
A. Elbow method
B. Manhattan method
C. Ecludian mehthod
E. None of these
Solution: (A)
Out of the given options, only elbow method is used for finding the optimal number of clusters.
The elbow method looks at the percentage of variance explained as a function of the number of
clusters: One should choose a number of clusters so that adding another cluster doesn’t give
much better modeling of the data.
Options:
A. 1 and 3
B. 1 and 2
C. 2 and 3
D. 1, 2 and 3
Solution: (D)
All three of the given statements are true. K-means is extremely sensitive to cluster center
initialization. Also, bad initialization can lead to Poor convergence speed as well as bad overall
clustering.
Q32. Which of the following can be applied to get good results for K-means algorithm
corresponding to global minima?
Options:
A. 2 and 3
B. 1 and 3
C. 1 and 2
D. All of above
Solution: (D)
All of these are standard practices that are used in order to obtain good clustering results.
Q33. What should be the best choice for number of clusters based on the following results:
A. 5
B. 6
C. 14
D. Greater than 14
Solution: (B)
Based on the above results, the best choice of number of clusters using elbow method is 6.
Q34. What should be the best choice for number of clusters based on the following results:
A. 2
B. 4
C. 6
D. 8
Solution: (C)
Generally, a higher average silhouette coefficient indicates better clustering quality. In this plot,
the optimal clustering number of grid cells in the study area should be 2, at which the value of
the average silhouette coefficient is highest. However, the SSE of this clustering solution (k = 2)
is too large. At k = 6, the SSE is much lower. In addition, the value of the average silhouette
coefficient at k = 6 is also very high, which is just lower than k = 2. Thus, the best choice is k =
6.
Q35. Which of the following sequences is correct for a K-Means algorithm using Forgy
method of initialization?
1. Specify the number of clusters
2. Assign cluster centroids randomly
3. Assign each data point to the nearest cluster centroid
4. Re-assign each point to nearest cluster centroids
5. Re-compute cluster centroids
Options:
A. 1, 2, 3, 5, 4
B. 1, 3, 2, 4, 5
C. 2, 1, 3, 4, 5
D. None of these
Solution: (A)
The methods used for initialization in K means are Forgy and Random Partition. The Forgy
method randomly chooses k observations from the data set and uses these as the initial means.
The Random Partition method first randomly assigns a cluster to each observation and then
proceeds to the update step, thus computing the initial mean to be the centroid of the cluster’s
randomly assigned points.
Q36. If you are using Multinomial mixture models with the expectation-maximization
algorithm for clustering a set of data points into two clusters, which of the assumptions are
important:
Solution: (C)
In EM algorithm for clustering its essential to choose the same no. of clusters to classify the data
points into as the no. of different distributions they are expected to be generated from and also
the distributions must be of the same type.
Q37. Which of the following is/are not true about Centroid based K-Means clustering
algorithm and Distribution based expectation-maximization clustering algorithm:
Options:
A. 1 only
B. 5 only
C. 1 and 3
D. 6 and 7
E. 4, 6 and 7
Solution: (B)
All of the above statements are true except the 5th as instead K-Means is a special case of EM
algorithm in which only the centroids of the cluster distributions are calculated at each iteration.
Q38. Which of the following is/are not true about DBSCAN clustering algorithm:
Options:
A. 1 only
B. 2 only
C. 4 only
D. 2 and 3
E. 1 and 5
F. 1, 3 and 5
Solution: (D)
DBSCAN can form a cluster of any arbitrary shape and does not have strong assumptions
for the distribution of data points in the dataspace.
DBSCAN has a low time complexity of order O(n log n) only.
Q39. Which of the following are the high and low bounds for the existence of F-Score?
A. [0,1]
B. (0,1)
C. [-1,1]
Solution: (A)
The lowest and highest possible values of F score are 0 and 1 with 1 representing that every data
point is assigned to the correct cluster and 0 representing that the precession and/ or recall of the
clustering analysis are both 0. In clustering analysis, high value of F score is desired.
Q40. Following are the results observed for clustering 6000 data points into 3 clusters: A, B
and C:
What is the F1-Score with respect to cluster B?
A. 3
B. 4
C. 5
D. 6
Solution: (D)
Here,
29-Aug-18
Select the correct option which directly achieve multi-class classification (without
support of binary classifiers)
K Nearest Neighbor SVM Neural networks Decision trees
Classification where each data is mapped to more than one class is called
Multi class classification(X) Multi label classification Binary classification
The classification where each data is mapped to more than one class is called Binary Classification.
Sentiment classification is a special task of text classification whose objective is to classify a text
according to the sentimental polarities of opinions it contains (Pang et al., 2002), e.g., favorable or
unfavorable, positive or negative. SciKit-Learn. Scikit-learn is open source machine learning library for
the Python programming language. ..
Imagine you have just finished training a decision tree for spam classication and it
is showing abnormal bad performance on both your training and test sets. Assume
that your implementation has no bugs. What could be reason for this problem
Your decision trees are too shallow.
You need to increase the learning rate
You are overfitting.
All the options
19/09/2018
Select the correct statements about Nonlinear classification
kernel tricks are used by Nonlinear classifiers to achieve maximum-margin hyperplanes.
The concept of slack variables is used in SVM for Nonlinear classification
kernel trick is used in SVM for non-linear classification
Which of the given hyper parameter(s), when increased may cause random forest to
over fit the data?
Number of Trees Learning Rate Depth of Tree
Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not an hyperparameter
in random forest. Increase in the number of tree will cause under fitting.
Which of the following is not a preprocessing method used for unstructured data
classification?
confusion_matrix stop word removal lemmatization stemming
Which NLP technique uses lexical knowledge base to obtain the correct base form
of the words?
IDF diminishes the weight of the most commonly occurring words and increases the
weightage of rare words.
SVM is a
weakly supervised learning algorithm. supervised learning algorithm.
Semi-supervised learning algorithm. unsupervised learning algorithm.
1. It will converge quicker than discriminative models like logistic regression AND it requires less
training data
2. Requires less training data
3. None of the options
4. It will converge quicker than discriminative models like logistic regression
Higher value of which of the following hyper-parameters is better for decision tree algorithm?
1. Cannot say
2. Number of samples used for split
3. Depth of tree
4. Samples for leaf
Which of the given hyper parameter(s), when increased may cause random forest to over fit the
data?
1. Number of Trees
2. Learning Rate
3. Depth of Tree
Choose the correct sequence for classifier building from the following:
Which numerical statistics is used to identify the importance of a rare word in a document?
1. TF
2. TF-IDF
3. None of the options
4. DF
Supervised learning differs from unsupervised learning in that supervised learning requires
1. Raw data
2. Labeled data
3. Unlabeled data
4. None of the options
Which NLP technique uses lexical knowledge base to obtain the correct base form of the words?
1. lemmatization
2. tokenization
3. object standarization
4. stop word removal
What is the output of the sentence “Good words bring good feelings to the heart” after
performing tokenization, lemmatization and stop word removal.
Classification where each data is mapped to more than one class is called
1. Binary classification
2. Multi Label Classification
3. Multi Class Classification
1. Structured Data
2. Unstructured Data
SVM is a
false
Which type of cross validation is used for imbalanced dataset?
answer: 3. Predominantly used for calculating the term (word) frequency or the number of
times a term occurs in a document/sentence.
The Term Document Matrix (TDM) is a matrix that contains the frequency of occurrence of
terms in a collection of documents.
In a Term Frequency Inverse Document Frequency (TFIDF) matrix, the term importance is
expressed by Inverse Document Frequency (IDF). IDF diminishes the weight of the most
commonly occurring words and increases the weightage of rare words.
1. Logistic regression
2. SVM
3. Linear regression
4. Decision tree
Unstructured data
true
Term Frequency-Inverse Document Frequency
Which of the following is not a pre-processing method used for unstructured data classification?
1. stemming
2. confusion matrix
3. lemmatization
4. stop word removal
Confusion Matrix
Classification Report
Decision Tree X
Accuracy score
Which of the following command is used to view the dataset SIZE and what is the value
returned?
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it
to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
sentiment_analysis_data.shape,(7086, 3)
Imagine you have just finished training a decision tree for spam classication and it is showing
abnormal bad performance on both your training and test sets. Assume that your
implementation has no bugs. What could be reason for this problem
sklearn
What is the tokenized output of the sentence "if you cannot do great things, do small things in a
great way"
A technique used to depict the performance in a tabular form that has 2 dimensions namely
'actual' and 'predicted' sets of data.
Confusion Matrix
What is the output of the sentence "Good words bring good feelings to the heart" after
performing tokenization, lemmatization and stop word removal.
YES
Select the correct option which directly achieve multi-class classification (without support of
binary classifiers)
To view the first 3 rows of the dataset, which of the following commands are used?
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it
to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
sentiment_analysis_data.head(3)
Stopword removal
Lemmatization
All the options
Tokenization
Stemming
True
TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is
intended to reflect how important a word is to a document in a collection or corpus.
True
1. SGDClassifier
2. StratifiedShuffleSplit
3. SVM
4. Random Forest
True
[yes no]
None of these
[1 0] ?
[true false]
What command should be given to tokenize a sentence into words?
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it
to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
from nltk.tokenize import word_tokenize
Word_tokens =word_tokenize(sentence)
Let's assume, you are solving a classification problem with highly imbalanced class. The majority
class is observed 99% of times in the training data. Which of the following is true when your
model has 99% accuracy after taking the predictions on test data. ?
1. For imbalanced class problems, precision and recall metrics aren’t good.
2. For imbalanced class problems, accuracy metric is a good idea.
3. For imbalanced class problems, accuracy metric is not a good idea.
Which of the following command is used to view the dataset SIZE and what is the
value returned?
sentiment_analysis_data.shape,(7086, 3)
sentiment_analysis_data.size,(7086, 3)
sentiment_analysis_data.size(),(7086, 2)
sentiment_analysis_data.shape(),(7086, 2)
What is the tokenized output of the sentence “if you cannot do great things, do small
things in a great way”
'Only', 'heart', 'tells'
'Only', 'do', 'what', 'your', 'heart', 'tell', 'you' (X)
'Only', 'do', 'what', 'heart', 'tells'
'Only', 'do', 'what', 'your', 'heart', 'tells', 'you'
Choose the correct sequence from the following:
Data Analysis ->PreProcessing -> Model Building--> Predict
PreProcessing -> Predict-->Train XX
PreProcessing -> Model Building--> Predict XX
Data Analysis ->PreProcessing -> Predict--> Train
Which of the given hyper parameter(s), when increased may cause random forest to over fit the
data?
1. Number of Trees
2. Learning Rate
3. Depth of Tree