100% found this document useful (2 votes)
3K views

Information Security Awareness - Refresher Course

This document provides an overview of classifying unstructured text data. It discusses how 80% of business information is unstructured text and introduces a case study on SMS spam detection. The document outlines the steps involved in text classification, including data loading, preprocessing techniques like tokenization, lemmatization, and stopword removal, feature extraction methods like bag-of-words and TF-IDF, and classification algorithms like decision trees, SGD, SVM, and random forests. The goal is to apply machine learning to categorize text messages as spam or ham.

Uploaded by

sai damodar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
3K views

Information Security Awareness - Refresher Course

This document provides an overview of classifying unstructured text data. It discusses how 80% of business information is unstructured text and introduces a case study on SMS spam detection. The document outlines the steps involved in text classification, including data loading, preprocessing techniques like tokenization, lemmatization, and stopword removal, feature extraction methods like bag-of-words and TF-IDF, and classification algorithms like decision trees, SGD, SVM, and random forests. The goal is to apply machine learning to categorize text messages as spam or ham.

Uploaded by

sai damodar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 83

Unstructured Data Classification

Why this course?


This course gives you a practical experience for solving unstructured text classification
problems. If you're wondering why you need unstructured text,

"80% of business relevant information originates in the unstructured form,


primarily text ", says Seth Grimes, a leading analytics strategy consultant.

What Would you Need to Follow Along?


 Have a basic understanding of machine learningconcepts.

 Try out the code snippets given for the case study.

 Refer the links to gain an in-depth understanding of other machine learning


techniques.

"Programming is usually taught by examples" -Niklaus Wirth

Introduction

Unstructured data, as the name suggests, does not have a structured format and may
contain data such as dates, numbers or facts.
*This results in irregularities and ambiguities which make it difficult to understand using
traditional programs when compared to data stored in fielded form in databases or
annotated (semantically tagged) in documents.

 Source : Wikipedia.

A few examples of unstructured data are:

 Emails

 Word Processing Files

 PDF files

 Spreadsheets

 Digital Images

 Video

 Audio

 Social Media Posts etc.

Identify the unstructured data from the following


Excel Data Image from mySQL DB

Problem Description

Let us understand unstructured data classification through the following case study:

SMS Spam Detection:

In our day-to-day lives, we receive a large number of spam/junk messages either in the
form of Text(SMS) or E-mails. It is important to filter these spam messages since they
are not truthful or trustworthy.

In this case study, we apply various machine learning algorithms to categorize the
messages depending on whether they are spam or not.

Your Playground
You can try your hands-on exercises using Katacoda or having the coding setup done
on your local machine.

For Katacoda Users:

 Open the link : https://www.katacoda.com/courses/python/playground

 Type the terminal commands in the pane below.

 You can use the Python editor (by default you have app.py file) for trying out the
code snippets given in this course.

 You can execute the Python code by clicking the Run command from the left
pane.

Your Playground...
Note: In case you don't find any of the required packages while playing around with the
case study, you can do the following :

 pip install nltk --target=./.Here, for Eg: nltk is the package you need to
download.

 For NLTK, you have a few other dependent packages. You can perform the
following steps to download them :

o Open the python terminal in the cmd prompt.(Type python)

o Type import nltk

o Type nltk.download()

o Type d for download

o Type all to download all dependent packages of NLTK.

Setup Your Local Machine


To run the code locally:

 Install Python 2.7+ in your machine.

 Install the required packages - Pandas, Sklearn, Numpy(Use pip install).


 Use any IDE (PyCharm, Spyder etc.) for trying out the code snippets.

Note: You can find brief descriptions of the python packages here.

Dataset Download

The dataset is available at -SMS Spam dataset link .

Open the terminal and type the following command to download.

curl
https://www.researchgate.net/profile/Tiago_Almeida4/publication/258050002_SMS_Spam_Co
llection_v1/data/00b7d526d127ded162000000/SMSSpamCollection.txt>dataset.csv

This command downloads the data and saves it as dataset.csv.

Dataset Description
The dataset contains customer usage pattern of a telecommunication company.

The following is a description of our dataset:

 No. of Classes: 2 (Spam / Ham)

 No. of attributes (Columns): 2

 No. of instances (Rows) : 5574

Data Loading

To start with data loading, import the required python package and load the downloaded
CSV file.

The data can be stored as dataframe for easy data manipulation/analysis. Pandas is
one of the most widely used libraries for this.
import pandas as pd

import csv

#Data Loading

messages = [line.rstrip() for line in open('dataset.csv')]

printlen(messages)

#Appending column headers

messages = pd.read_csv('dataset.csv', sep='\t',


quoting=csv.QUOTE_NONE,names=["label", "message"])

As you can see, our dataset has 2 columns without any headers.

This code snippet reads the data using pandas and labels the column names
as label and message.

Data Analysis
Analyzing data is a must in any classification problem. The goal of data analysis is to
derive useful information from the given data for making decisions.

In this section, we will analyze the dataset in terms of size, headers, view data summary
and a sample data.

You can see the dataset size using :

data_size=messages.shape

print(data_size)

Column names can be viewed by :

messages_col_names=list(messages.columns)

print(messages_col_names)

To understand aggregate statistics easily, use the following command :

print(messages.groupby('label').describe())
To see a sample data, use the following command :

print(messages.head(3))

Target Identification

Target is the class/category to which you will assign the data.

 In this case, you aim to identify whether the message is spam or not.

 By observing the columns, the label column has values Spam or Ham . We can
call this case study a Binary Classification, since it has only two possible
outcomes.

#Identifying the outcome/target variable.

message_target=messages['label']

print(message_target)

What kind of classification is our case study 'Spam Detection'?


Binary Multi class Multi label

Tokenization

Tokenization is a method to split a sentence/string into substrings. These substrings


are called tokens.

In Natural Language Processing (NLP), tokenization is the initial step in. Splitting a
sentence into tokens helps to remove unwanted information in the raw text such as
white spaces, line breaks and so on.

importnltk

fromnltk.tokenize import word_tokenize

defsplit_tokens(message):

message=message.lower()

message = unicode(message, 'utf8') #convert bytes into proper unicode

word_tokens =word_tokenize(message)
returnword_tokens

messages['tokenized_message'] = messages.apply(lambda row:


split_tokens(row['message']),axis=1)

Lemmatization

 Lemmatization is a method to convert a word into its base/root form.

 Lemmatizer removes affixes of the words present in its dictionary.

Stop Word Removal


Stop words are commons words that do not add any relevance for classification (For eg.
“the”, “a”, “an”, “in” etc.). Hence, it is essential to remove these words.

fromnltk.corpus import stopwords

defstopword_removal(message):

stop_words = set(stopwords.words('english'))

filtered_sentence = []

filtered_sentence = ' '.join([word for word in message if word not in stop_words])

returnfiltered_sentence
messages['preprocessed_message'] = messages.apply(lambda row:
stopword_removal(row['lemmatized_message']),axis=1)

Training_data=pd.Series(list(messages['preprocessed_message']))

Training_label=pd.Series(list(messages['label']))

Why Feature Extraction is important?


To perform machine learning on text documents, you first need to turn the text content
into numerical feature vectors.

In Python, you have a few packages defined under sklearn.

We will be looking into a few specific ones used for unstructured data.

Bag Of Words(BOW)
 Bag of Words (BOW) is one of the most widely used methods for generating
features in Natural Language Processing.

 Representing/Transforming a text into a bag of words helps to identify various


measures to characterize the text.

 Predominantly used for calculating the term(word) frequency or the number of


times a term occurs in a document/sentence.

 It can be used as a feature for training the classifier.

Term Document Matrix

 The Term Document Matrix (TDM) is a matrix that contains the frequency of
occurrence of terms in a collection of documents.

 In a TDM, the rows represent documents and columns represent the terms.

fromsklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tf_vectorizer = CountVectorizer(ngram_range=(1, 2),min_df = (1/len(Training_label)),


max_df = 0.7)
Total_Dictionary_TDM = tf_vectorizer.fit(Training_data)

message_data_TDM = Total_Dictionary_TDM.transform(Training_data)

Term Frequency Inverse Document Frequency


(TFIDF)

 In a Term Frequency Inverse Document Frequency (TFIDF) matrix, the


term importance is expressed by Inverse Document Frequency (IDF).

 IDF diminishes the weight of the most commonly occurring words and increases
the weightage of rare words.

fromsklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2),min_df =


(1/len(Training_label)), max_df = 0.7)

Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)

message_data_TFIDF = Total_Dictionary_TFIDF.transform(Training_data)
Let's take the TDM matrix for further evaluation. You can also try out the same using
TFIDF matrix.
Which preprocessing technique is used to remove the most commonly used
words? Tokenization Lemmatization Stopword removal

Classification Algorithms
There are various algorithms to solve the classification problems. The code to try out a
few of these algorithms will be presented in the upcoming cards.

We will discuss the following :

 Decision Tree Classifier

 Stochastic Gradient Descent Classifier

 Support Vector Machine Classifier

 Random Forest Classifier

Note:- The explanation for these algorithms are given in the Machine Learning
Axioms course. Refer the course for further details.

How Does a Classifier Work?

The following are the steps involved in building a classification model:

1. Initialize the classifier to be used.


2. Train the classifier - All classifiers in scikit-learn uses a fit(X, y) method to fit the
model(training) for the given train data X and train label y.

3. Predict the target - Given an unlabeled observation X, the predict(X) returns the
predicted label y.

4. Evaluate the classifier model - The score(X,y) returns the score for the given test
data X and test label y.

Train and Test Data


The code snippet provided here is for partitioning the data into train and test for building
the classifier model. This split will be used to explain classification algorithms.

Decision Tree Classification

 It is one of the commonly used classification techniques for performing binary as


well as multi-class classification.

 The decision tree model predicts the class/target by learning simple decision
rules from the features of the data.

Stochastic Gradient Descent Classifier

 It is used for large scale learning

 It supports different loss functions & penalties for classification

Support Vector Machine

 Support Vector Machine(SVM) is effective in high-dimensional spaces.

 It is effective in cases where the number of dimensions is greater than the


number of samples.

 It works well with a clear margin of separation.


Random Forest Classifier

 Controls over fitting

 Here, a random forest fits a number of decision tree classifiers on various sub-
samples of the dataset and uses averaging to improve the predictive accuracy.

Model Tuning
The classification algorithms in machine learning are parameterized. Modifying any of
those parameters can influence the results. So algorithm/model tuning is essential to
find out the best model.

For example, let's take the Random Forest Classifier and change the values of a few
parameters (n_ estimators,max_ features)

Partitioning the Data


It is a methodological mistake to test and train on the same dataset. This is because the
classifier would fail to predict correctly for any unseen data. This could result
in overfitting.

To avoid this problem,

 Split the data to train set, validation set and test set.

o Training Set: The data used to train the classifier.

o Validation Set: The data used to tune the classifier model parameters i.e.,
to understand how well the model has been trained (a part of training
data).

o Testing Set: The data used to evaluate the performance of the classifier
(unseen data by the classifier).

 This will help you know the efficiency of your model.

Cross Validation
 Cross validation is a model validation technique to evaluate the performance of
a model on unseen data (validation set).

 It is a better estimate to evaluate testing accuracy than training accuracy on


unseen data.

Points to remember:

 Cross validation gives high variance if the testing set and training set are not
drawn from same population.

 Allowing training data to be included in testing data will not give actual
performance results.

In cross validation, the number of samples used for training the model is reduced and
the results depend on the choice of the pair of training and testing sets.

You can refer to the various CV approaches here.

Stratified Shuffle Split


The StratifiedShuffleSplit splits the data by taking an equal number of samples from
each class in a random manner.

StratifiedShuffleSplit would suit our case study as the dataset has a class imbalance
which can be seen from the following code snippet:

seed=7

fromsklearn.cross_validation import StratifiedShuffleSplit

#creating cross validation object with 10% test size

cross_val = StratifiedShuffleSplit(Training_label,1, test_size=0.1,random_state=seed)

test_size=0.1 denotes that 10 % of the dataset is used for testing.

Stratified Shuffle Split Contd...


This selection is then used to split the data into test and train sets.

fromsklearn.neighbors import KNeighborsClassifier

fromsklearn.multiclass import OneVsRestClassifier

fromsklearn import svm

classifiers = [

DecisionTreeClassifier(),

SGDClassifier(loss='modified_huber', shuffle=True),

SVC(kernel="linear", C=0.025),

KNeighborsClassifier(),

OneVsRestClassifier(svm.LinearSVC()),

RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10),

forclf in classifiers:

score=0

fortrain_index, test_index in cross_val:


X_train, X_test = message_data_TDM [train_index], message_data_TDM [test_index]

y_train, y_test = Training_label[train_index], Training_label[test_index]

clf.fit(X_train, y_train)

score=score+clf.score(X_test, y_test)

print(score)

The above code uses ensemble of classifiers for cross validation. It helps to select the
best classifier based on the cross validation scores. The classifier with the highest
score can be used for building the classification model.

Note: You may add or remove classifiers based on the requirement.


Cross-validation technique is used to evaluate a classifier by dividing the data
set into training set to train the classifier and testing set to test the same.
TRUE or FALSE

Classification Accuracy
 The classification accuracy is defined as the percentage of correct predictions.

fromsklearn.metrics import accuracy_score

print('Accuracy Score',accuracy_score(test_label,message_predicted_target))

classifier = classifier.fit(train_data, train_label)

score=classifier.score(test_data, test_label)
test_label.value_counts()

 This simple classification accuracy will not tell us the types of errors by our
classifier.

 It is just an easier method, but it will not give us the latent distribution of response
values.

Confusion Matrix
It is a technique to evaluate the performance of a classifier.

 It depicts the performance in a tabular form that has 2 dimensions namely


“actual” and “predicted” sets of data.

 The rows and columns of the table show the count of false positives, false
negatives, true positives and true negatives.

fromsklearn.metrics import confusion_matrix

print('Confusion Matrix',confusion_matrix(test_label,message_predicted_target))

The first parameter shows true values and the second parameter shows predicted
values.

Confusion Matrix
This image is a confusion matrix for a two class classifier.

In the table,

 TP (True Positive) - The number of correct predictions that the occurrence is


positive

 FP (False Positive) - The number of incorrect predictions that the occurrence is


positive

 FN (False Negative) - The number of incorrect predictions that the occurrence is


negative

 TN (True Negative)- The number of correct predictions that the occurrence is


negative

 TOTAL - The total number of occurrences

Plotting Confusion Matrix


To evaluate the quality of output, it is always better to plot and analyze the results.

For our case study, we have plotted the confusion matrix of Decision Tree Classifier
which is given in the above image.

The function for plotting confusion matrix is given here.

Classification Report
The classification_report function shows a text report with the commonly used
classification metrics.

fromsklearn.metrics import classification_report

target_names = ['spam', 'ham']

print(classification_report(test_label, message_predicted_target,
target_names=target_names))

Precision

 When a positive value is predicted, how often is the prediction correct?

Recall
 It is the true positive rate.

 When the value is positive, how often does the prediction turn out to be correct?

To know more about model evaluation, check this link.

Other Libraries
For our demonstration purpose, we have used Python with NLTK. There are many more
libraries specific to Java/Ruby, etc.

You can find the reference link here:

NLP Libraries

True Negative is when the predicted instance and the actual is positive.
TRUE OR FALSE
True Positive is when the predicted instance and the actual instance is not
negative. TRUE OR FALSE

Unstructured Data Classification - Course


Summary
In this course, we discussed the following :

-Identifying unstructured data.

 Selecting the ideal features for processing.

 Various pre-processing steps for text classification with practical exercises.

 A few of the classification algorithms.

 Classifier performance evaluation.

Q&A

Cross-validation causes over-fitting. TRUE OR FALSE


In document classification, each document has to be coverted from full text to
a document vector TRUE / FALSE

a) Download the dataset from:


https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.

What is the output of the following


command:print(sentiment_analysis_data['label'].unique())
[yes no]
None of these
[true false]
[1 0]

A classifer that can compute using numeric as well as categorical values is


Naive Bayes Classifier Decision Tree Classifier
SVM Classifier Random Forest Classifier

Stemming and lemmatization gives the same result. True or false

a) Download the dataset from:


https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.

To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.get(3)
sentiment_analysis_data.select(3)
sentiment_analysis_data.top(3)
sentiment_analysis_data.he

In Supervised learning, class labels of the training samples are


Partially known Known Unknown Does not matter
An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen
instances. This requires the learning algorithm to generalize from the training data to unseen
situations in a "reasonable" way (see inductive bias).
The parallel task in human and animal psychology is often referred to as concept learning.

a) Download the dataset from:


https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.

High classification accuracy always indicates a good classifier. False orTrue


In machine learning, multiclass or multinomial classification is the problem
of classifying instances into one of three or more classes. (Classifying instances into one of the two
classes is called binary classification.)
While some classification algorithms naturally permit the use of more than two classes, others are by
nature binary algorithms; these can, however, be turned into multinomial classifiers by a variety of
strategies.
Multiclass classification should not be confused with multi-label classification, where multiple labels
are to be predicted for each instance.

General strategies[edit]
The existing multi-class classification techniques can be categorized into (i) Transformation to binary
(ii) Extension from binary and (iii) Hierarchical classification.[1]

Transformation to binary[edit]
This section discusses strategies for reducing the problem of multiclass classification to multiple
binary classification problems. It can be categorized into One vs Rest and One vs One. The
techniques developed based on reducing the multi-class problem into multiple binary problems can
also be called problem transformation techniques.
One-vs.-rest[edit]
One-vs.-rest[2]:182, 338 (or one-vs.-all, OvA or OvR, one-against-all, OAA) strategy involves training a
single classifier per class, with the samples of that class as positive samples and all other samples
as negatives. This strategy requires the base classifiers to produce a real-valued confidence score
for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities,
where multiple classes are predicted for a single sample.[3]:182[note 1]
In pseudocode, the training algorithm for an OvA learner constructed from a binary classification
learner L is as follows:
Inputs:

 L, a learner (training algorithm for binary classifiers)


 samples X
 labels y where yi ∈ {1, … K} is the label for the sample Xi
Output:
 a list of classifiers fk for k ∈ {1, …, K}
Procedure:

 For each k in {1, …, K}


 Construct a new label vector z where zi = 1 if yi = k and zi = 0 otherwise
 Apply L to X, z to obtain fk
Making decisions means applying all classifiers to an unseen sample x and predicting
the label k for which the corresponding classifier reports the highest confidence score:

Although this strategy is popular, it is a heuristic that suffers from several problems.
Firstly, the scale of the confidence values may differ between the binary classifiers.
Second, even if the class distribution is balanced in the training set, the binary
classification learners see unbalanced distributions because typically the set of
negatives they see is much larger than the set of positives.[3]:338
One-vs.-one[edit]
In the one-vs.-one (OvO) reduction, one trains K (K − 1) / 2 binary classifiers for
a K-way multiclass problem; each receives the samples of a pair of classes from the
original training set, and must learn to distinguish these two classes. At prediction
time, a voting scheme is applied: all K (K − 1) / 2 classifiers are applied to an
unseen sample and the class that got the highest number of "+1" predictions gets
predicted by the combined classifier.[3]:339
Like OvR, OvO suffers from ambiguities in that some regions of its input space may
receive the same number of votes.[3]:183

Extension from binary[edit]


This section discusses strategies of extending the existing binary classifiers to solve
multi-class classification problems. Several algorithms have been developed based
on neural networks, decision trees, k-nearest neighbors, naive Bayes, support
vector machines and Extreme Learning Machines to address multi-class
classification problems. These types of techniques can also be called as algorithm
adaptation techniques.
Neural networks[edit]
Multilayer perceptrons provide a natural extension to the multi-class problem.
Instead of just having one neuron in the output layer, with binary output, one could
have N binary neurons leading to multi-class classification. In practice, the last layer
of a neural network is usually a softmax function layer, which is the algebraic
simplification of N logistic classifiers, normalized per class by the sum of the N-1
other logistic classifiers.
Extreme learning machines[edit]
Extreme Learning Machines (ELM) is a special case of single hidden layer feed-
forward neural networks (SLFNs) where in the input weights and the hidden node
biases can be chosen at random. Many variants and developments are made to the
ELM for multiclass classification.
k-nearestneighbours[edit]
k-nearest neighbors kNN is considered among the oldest non-parametric
classification algorithms. To classify an unknown example, the distance from that
example to every other training example is measured. The k smallest distances are
identified, and the most represented class by these k nearest neighbours is
considered the output class label.
Naive Bayes[edit]
Naive Bayes is a successful classifier based upon the principle of maximum a
posteriori (MAP). This approach is naturally extensible to the case of having more
than two classes, and was shown to perform well in spite of the underlying
simplifying assumption of conditional independence.
Decision trees[edit]
Decision trees are a powerful classification technique. The tree tries to infer a split
of the training data based on the values of the available features to produce a good
generalization. The algorithm can naturally handle binary or multiclass classification
problems. The leaf nodes can refer to either of the K classes concerned.
Support vector machines[edit]
Support vector machines are based upon the idea of maximizing the margin i.e.
maximizing the minimum distance from the separating hyperplane to the nearest
example. The basic SVM supports only binary classification, but extensions have
been proposed to handle the multiclass classification case as well. In these
extensions, additional parameters and constraints are added to the optimization
problem to handle the separation of the different classes.

Hierarchical classification[edit]
Hierarchical classification tackles the multi-class classification problem by dividing
the output space i.e. into a tree. Each parent node is divided into multiple child
nodes and the process is continued until each child node represents only one class.
Several methods have been proposed based on hierarchical classification.

Learning paradigms[edit]
Based on learning paradigms, the existing multi-class classification techniques can
be classified into batch learning and online learning. Batch learning algorithms
require all the data samples to be available beforehand. It trains the model using the
entire training data and then predicts the test sample using the found relationship.
The online learning algorithms, on the other hand, incrementally build their models
in sequential iterations. In iteration t, an online algorithm receives a sample, xt and
predicts its label ŷt using the current model; the algorithm then receives yt, the true
label of xt and updates its model based on the sample-label pair: (xt, yt). Recently, a
new learning paradigm called progressive learning technique has been
developed.[4] The progressive learning technique is capable of not only learning from
new samples but also capable of learning new classes of data and yet retain the
knowledge learnt thus far.

Which of the following is not a performance evaluation measure?


Accuracy score (X) DecisionTree Confusion matrix Classification report
26/07/2018 (1)

In a Term Document Matrix (TDM) each row represents ______?


TF-IDF value TF value document word

The Term Document Matrix (TDM) is a matrix that contains the frequency of occurrence of
terms in a collection of documents. In a TDM, the rows represent documents and columns
represent the terms.

email spam data is an example of


Unstructured Data Structured Data

High classification accuracy always indicates a good classifier.TRUEFALSE(X)


It is false, because you might have high, but error might be unacceptable?

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.get(3)(X)
sentiment_analysis_data.select(3)
sentiment_analysis_data.head(3)
sentiment_analysis_data.top(3)

Which one of the following is nota classification technique?


SGDClassifier SVM StratifiedShuffleSplit Random Forest

Classification Algorithms

 Decision Tree Classifier

 Stochastic Gradient Descent Classifier


 Support Vector Machine Classifier

 Random Forest Classifier

A technique used to depict the performance in a tabular form that has 2 dimensions
namely “actual” and “predicted” sets of data.

Confusion Matrix Cross Validation Classification Report Classification Accuracy

Confusion Matrix is a technique to evaluate the performance of a classifier. It


depicts the performance in a tabular form that has 2 dimensions namely “actual”
and “predicted” sets of data.The rows and columns of the table show the count of
false positives, false negatives, true positives and true negatives.

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

(may be, as Multiclass is for more than two class)

Higher value of which of the following hyperparameters is better for decision tree
algorithm?
Number of samples used for split Depth of tree
Cannot say Samples for leaf
Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not anhyperparameter
in random forest. Increase in the number of tree will cause under fitting.

27/07/18 (2)

Pruning is a technique associated with


Decision tree Logistic regression SVM Linear regression

In document classification, each document has to be coverted from full text to a


document vector TRUE / FALSE

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

Supervised learning differs from unsupervised learning in that supervised learning


requires
None of the options raw data Labeled data Unlabeled data
The key difference between supervised and unsupervised learning is that supervised learning trying to
predict the labelsP(Y|X), while in unsupervised learning we are trying to get a model that model the X
distribution P(X). And you may ask what is P(X) and what is a good P(X)?

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

What is the output of the following command:


print(sentiment_analysis_data['label'].unique())
[yes no] [true false] None of these [1 0]

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.


Select pre-processing techniques from the options
All the options Lemmatization Stemming Stopword removal Tokenization

What is the output of the sentence “Good words bring good feelings to the heart”
after performing tokenization, lemmatization and stop word removal.
'Good words bring good feelings heart'
['Good', 'words', 'bring', 'good', 'feelings', 'to', 'the', 'heart']
['Good', 'word', 'bring', 'good', 'feeling', 'to', 'the', 'heart']
'Good word bring good feeling heart'

Which of the following is nota performance evaluation measure?


Confusion matrix Accuracy score DecisionTree Classification report

26/07/18 (3)

Cross-validation causes over-fitting. TRUE FALSE


In document classification, each document has to be coverted from full text to a
document vector TRUE FALSE
High classification accuracy always indicates a good classifier. TRUE FALSE

Pruning is a technique associated with


Decision tree Logistic regression SVM Linear regression

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

Supervised learning differs from unsupervised learning in that supervised learning


requires
Unlabeled data None of the options Labeled data raw data
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.get(3)
sentiment_analysis_data.head(3)
sentiment_analysis_data.top(3)
sentiment_analysis_data.select(3)

XXX

SVM is a
weakly supervised learning algorithm.
Semi-supervised learning algorithm.
supervised learning algorithm.
unsupervised learning algorithm.

TF-IDF is a freature extraction technique


False True

Lemmatization offers better precision than stemming


True False

Choose the correct sequence for classifier building from the following:
None of the options
Train -> Test -> Initialize ->Predict
Initialize -> Evaluate -> Train -> Predict
Initialize -> Train - -> Predict-->Evaluate

27/07/18 (1)

The data you have is called 'mixed data' because it has both numerical and categorical values. And since
you have class labels; therefore, it is a classification problem. One option is to go with decision trees,
which you already tried. Other possibilities are naive Bayes where you model numeric attributes by a
Gaussian distribution or so. You can also employ a minimum distance or KNN based approach; however,
the cost function must be able to handle data for both types together. If these approaches don't work
then try ensemble techniques. Try bagging with decision trees or else Random Forest that combines
bagging and random subspace. With mixed data, choices are limited and you need to be cautious and
creative with your choices.

a) Download the dataset from:

https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the variable

'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.head(3) sentiment_analysis_data.select(3)
sentiment_analysis_data.get(3) sentiment_analysis_data.top(3)

In document classification, each document has to be coverted from full text to a


document vector False True

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

Is there a class imbalance problem in the given data set? Yes No

Inverse Document frequency is used in term document matrix. False True


) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

Which of the following command is used to view the dataset SIZE and what is the
value returned?
sentiment_analysis_data.shape,(7086, 3)
sentiment_analysis_data.shape(),(7086, 2)
sentiment_analysis_data.size(),(7086, 2)
sentiment_analysis_data.size,(7086, 3)

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

Is there a class imbalance problem in the given data set? Yes No

A technique used to depict the performance in a tabular form that has 2 dimensions
namely “actual” and “predicted” sets of data.
Classification Report Classification Accuracy
Confusion Matrix Cross Validation

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.


What is the output of the following command:
print(sentiment_analysis_data['label'].unique())
[true false] [yes no] None of these [1 0]

Select pre-processing techniques from the options


Stemming Lemmatization Tokenization Stopword removal All the options

In a Term Document Matrix (TDM) each row represents ______?


Word document TF-IDF value TF value

In a TDM, the rows represent documents and columns represent the terms.

a) Download the dataset from:


https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it
to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.

Which of the following command is used to view the dataset SIZE and what is the value
returned?

sentiment_analysis_data.size,(7086, 3) sentiment_analysis_data.shape(),(7086, 2)

sentiment_analysis_data.size(),(7086, 2) sentiment_analysis_data.shape,(7086, 3)

In Supervised learning, class labels of the training samples are


Unknown Doesn’t matter Known Partially known

TF-IDF is a freature extraction technique TRUE(X)uor FALSE

Which of the following is not a performance evaluation measure?


Accuracy score DecisionTree Classification report Confusion matrix

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.


What does the command sentiment_analysis_data['label'].value_counts()
return?
The total count of elements in 'label' column
Number of columns in the dataset
Number of rows in the dataset
counts of unique values in the 'label' column

) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

In document classification, each document has to be coverted from full text to a


document vector TRUE FALSE
SVM is a
supervised learning algorithm. unsupervised learning algorithm.
Semi-supervised learning algorithm. weakly supervised learning algorithm.

An algorithm that counts how many times a word appears in a document


Bag-of-words(BOW) TF-IDF DTM TDM

Q1. Movie Recommendation systems are an example of:

1. Classification
2. Clustering
3. Reinforcement Learning
4. Regression

Options:

B. A. 2 Only

C. 1 and 2

D. 1 and 3

E. 2 and 3
F. 1, 2 and 3

H. 1, 2, 3 and 4

Solution: (E)

Generally, movie recommendation systems cluster the users in a finite number of similar groups
based on their previous activities and profile. Then, at a fundamental level, people in the same
cluster are made similar recommendations.

In some scenarios, this can also be approached as a classification problem for assigning the most
appropriate movie class to the user of a specific group of users. Also, a movie recommendation
system can be viewed as a reinforcement learning problem where it learns by its previous
recommendations and improves the future recommendations.

Q2. Sentiment Analysis is an example of:

1. Regression
2. Classification
3. Clustering
4. Reinforcement Learning

Options:

A. 1 Only

B. 1 and 2

C. 1 and 3

D. 1, 2 and 3

E. 1, 2 and 4

F. 1, 2, 3 and 4

Solution: (E)

Sentiment analysis at the fundamental level is the task of classifying the sentiments represented
in an image, text or speech into a set of defined sentiment classes like happy, sad, excited,
positive, negative, etc. It can also be viewed as a regression problem for assigning a sentiment
score of say 1 to 10 for a corresponding image, text or speech.
Another way of looking at sentiment analysis is to consider it using a reinforcement learning
perspective where the algorithm constantly learns from the accuracy of past sentiment analysis
performed to improve the future performance.

Q3. Can decision trees be used for performing clustering?

A. True

B. False

Solution: (A)

Decision trees can also be used to for clusters in the data but clustering often generates natural
clusters and is not dependent on any objective function.

Q4. Which of the following is the most appropriate strategy for data cleaning before
performing clustering analysis, given less than desirable number of data points:

1. Capping and flouring of variables


2. Removal of outliers

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of the above

Solution: (A)

Removal of outliers is not recommended if the data points are few in number. In this scenario,
capping and flouring of variables is the most appropriate strategy.

Q5. What is the minimum no. of variables/ features required to perform clustering?

A. 0
B. 1

C. 2

D. 3

Solution: (B)

At least a single variable is required to perform clustering analysis. Clustering analysis with a
single variable can be visualized with the help of a histogram.

Q6. For two runs of K-Mean clustering is it expected to get same clustering results?

A. Yes

B. No

Solution: (B)

K-Means clustering algorithm instead converses on local minima which might also correspond to
the global minima in some cases but not always. Therefore, it’s advised to run the K-Means
algorithm multiple times before drawing inferences about the clusters.

However, note that it’s possible to receive same clustering results from K-means by setting the
same seed value for each run. But that is done by simply making the algorithm choose the set of
same random no. for each run.

Q7. Is it possible that Assignment of observations to clusters does not change between
successive iterations in K-Means

A. Yes

B. No

C. Can’t say

D. None of these

Solution: (A)

When the K-Means algorithm has reached the local or global minima, it will not alter the
assignment of data points to clusters for two successive iterations.
Q8. Which of the following can act as possible termination conditions in K-Means?

1. For a fixed number of iterations.


2. Assignment of observations to clusters does not change between iterations. Except
for cases with a bad local minimum.
3. Centroids do not change between successive iterations.
4. Terminate when RSS falls below a threshold.

Options:

A. 1, 3 and 4

B. 1, 2 and 3

C. 1, 2 and 4

D. All of the above

Solution: (D)

All four conditions can be used as possible termination condition in K-Means clustering:

1. This condition limits the runtime of the clustering algorithm, but in some cases the
quality of the clustering will be poor because of an insufficient number of iterations.
2. Except for cases with a bad local minimum, this produces a good clustering, but runtimes
may be unacceptably long.
3. This also ensures that the algorithm has converged at the minima.
4. Terminate when RSS falls below a threshold. This criterion ensures that the clustering is
of a desired quality after termination. Practically, it’s a good practice to combine it with a
bound on the number of iterations to guarantee termination.

Q9. Which of the following clustering algorithms suffers from the problem of convergence
at local optima?

1. K- Means clustering algorithm


2. Agglomerative clustering algorithm
3. Expectation-Maximization clustering algorithm
4. Diverse clustering algorithm

Options:

A. 1 only
B. 2 and 3

C. 2 and 4

D. 1 and 3

E. 1,2 and 4

F. All of the above

Solution: (D)

Out of the options given, only K-Means clustering algorithm and EM clustering algorithm has
the drawback of converging at local minima.

Q10. Which of the following algorithm is most sensitive to outliers?

A. K-means clustering algorithm

B. K-medians clustering algorithm

C. K-modes clustering algorithm

D. K-medoids clustering algorithm

Solution: (A)

Out of all the options, K-Means clustering algorithm is most sensitive to outliers as it uses the
mean of cluster data points to find the cluster center.

Q11. After performing K-Means Clustering analysis on a dataset, you observed the
following dendrogram. Which of the following conclusion can be drawn from the
dendrogram?
A. There were 28 data points in clustering analysis

B. The best no. of clusters for the analyzed data points is 4

C. The proximity function used is Average-link clustering

D. The above dendrogram interpretation is not possible for K-Means clustering analysis

Solution: (D)

A dendrogram is not possible for K-Means clustering analysis. However, one can create a cluster
gram based on K-Means clustering analysis.

Q12. How can Clustering (Unsupervised Learning) be used to improve the accuracy of
Linear Regression model (Supervised Learning):

1. Creating different models for different cluster groups.


2. Creating an input feature for cluster ids as an ordinal variable.
3. Creating an input feature for cluster centroids as a continuous variable.
4. Creating an input feature for cluster size as a continuous variable.
Options:

A. 1 only

B. 1 and 2

C. 1 and 4

D. 3 only

E. 2 and 4

F. All of the above

Solution: (F)

Creating an input feature for cluster ids as ordinal variable or creating an input feature for cluster
centroids as a continuous variable might not convey any relevant information to the regression
model for multidimensional data. But for clustering in a single dimension, all of the given
methods are expected to convey meaningful information to the regression model. For example,
to cluster people in two groups based on their hair length, storing clustering ID as ordinal
variable and cluster centroids as continuous variables will convey meaningful information.

Q13. What could be the possible reason(s) for producing two different dendrograms using
agglomerative clustering algorithm for the same dataset?

A. Proximity function used

B. of data points used

C. of variables used

D. B and c only

E. All of the above

Solution: (E)

Change in either of Proximity function, no. of data points or no. of variables will lead to different
clustering results and hence different dendrograms.
Q14. In the figure below, if you draw a horizontal line on y-axis for y=2. What will be the
number of clusters formed?

A. 1

B. 2

C. 3

D. 4

Solution: (B)

Since the number of vertical lines intersecting the red horizontal line at y=2 in the dendrogram
are 2, therefore, two clusters will be formed.

Q15. What is the most appropriate no. of clusters for the data points represented by the
following dendrogram:
A. 2

B. 4

C. 6

D. 8

Solution: (B)

The decision of the no. of clusters that can best depict different groups can be chosen by
observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the
dendrogram cut by a horizontal line that can transverse the maximum distance vertically without
intersecting a cluster.
In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.

Q16. In which of the following cases will K-Means clustering fail to give good results?

1. Data points with outliers


2. Data points with different densities
3. Data points with round shapes
4. Data points with non-convex shapes

Options:

A. 1 and 2

B. 2 and 3

C. 2 and 4
D. 1, 2 and 4

E. 1, 2, 3 and 4

Solution: (D)

K-Means clustering algorithm fails to give good results when the data contains outliers, the
density spread of data points across the data space is different and the data points follow non-
convex shapes.

Q17. Which of the following metrics, do we have for finding dissimilarity between two
clusters in hierarchical clustering?

1. Single-link
2. Complete-link
3. Average-link

Options:

A. 1 and 2

B. 1 and 3

C. 2 and 3

D. 1, 2 and 3

Solution: (D)
All of the three methods i.e. single link, complete link and average link can be used for finding
dissimilarity between two clusters in hierarchical clustering.

Q18. Which of the following are true?

1. Clustering analysis is negatively affected by multicollinearity of features


2. Clustering analysis is negatively affected by heteroscedasticity

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of them

Solution: (A)

Clustering analysis is not negatively affected by heteroscedasticity but the results are negatively
impacted by multicollinearity of features/ variables used in clustering as the correlated feature/
variable will carry extra weight on the distance calculation than desired.

Q19. Given, six points with the following attributes:


Which of the following clustering representations and dendrogram depicts the use of MIN
or Single link proximity function in hierarchical clustering:

A.
B.

C.

D.
Solution: (A)

For the single link or MIN version of hierarchical clustering, the proximity of two clusters is
defined to be the minimum of the distance between any two points in the different clusters. For
instance, from the table, we see that the distance between points 3 and 6 is 0.11, and that is the
height at which they are joined into one cluster in the dendrogram. As another example, the
distance between clusters {3, 6} and {2, 5} is given by dist({3, 6}, {2, 5}) = min(dist(3, 2),
dist(6, 2), dist(3, 5), dist(6, 5)) = min(0.1483, 0.2540, 0.2843, 0.3921) = 0.1483.

Q20 Given, six points with the following attributes:

Which of the following clustering representations and dendrogram depicts the use of MAX
or Complete link proximity function in hierarchical clustering:
A.

B.
C.

D.

Solution: (B)

For the single link or MAX version of hierarchical clustering, the proximity of two clusters is
defined to be the maximum of the distance between any two points in the different clusters.
Similarly, here points 3 and 6 are merged first. However, {3, 6} is merged with {4}, instead of
{2, 5}. This is because the dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4)) = max(0.1513, 0.2216) =
0.2216, which is smaller than dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))
= max(0.1483, 0.2540, 0.2843, 0.3921) = 0.3921 and dist({3, 6}, {1}) = max(dist(3, 1), dist(6,
1)) = max(0.2218, 0.2347) = 0.2347.

Q21 Given, six points with the following attributes:


Which of the following clustering representations and dendrogram depicts the use of
Group average proximity function in hierarchical clustering:

A.
B.

C.

D.

Solution: (C)
For the group average version of hierarchical clustering, the proximity of two clusters is defined
to be the average of the pairwise proximities between all pairs of points in the different clusters.
This is an intermediate approach between MIN and MAX. This is expressed by the following
equation:

Here, the distance between some clusters. dist({3, 6, 4}, {1}) = (0.2218 + 0.3688 + 0.2347)/(3 ∗
1) = 0.2751. dist({2, 5}, {1}) = (0.2357 + 0.3421)/(2 ∗ 1) = 0.2889. dist({3, 6, 4}, {2, 5}) =
(0.1483 + 0.2843 + 0.2540 + 0.3921 + 0.2042 + 0.2932)/(6∗1) = 0.2637. Because dist({3, 6, 4},
{2, 5}) is smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}), these two clusters are merged at
the fourth stage

Q22. Given, six points with the following attributes:


Which of the following clustering representations and dendrogram depicts the use of
Ward’s method proximity function in hierarchical clustering:

A.

B.
C.

D.

Solution: (D)

Ward method is a centroid method. Centroid method calculates the proximity between two
clusters by calculating the distance between the centroids of clusters. For Ward’s method, the
proximity between two clusters is defined as the increase in the squared error that results when
two clusters are merged. The results of applying Ward’s method to the sample data set of six
points. The resulting clustering is somewhat different from those produced by MIN, MAX, and
group average.

Q23. What should be the best choice of no. of clusters based on the following results:
A. 1

B. 2

C. 3

D. 4

Solution: (C)

The silhouette coefficient is a measure of how similar an object is to its own cluster compared to
other clusters. Number of clusters for which silhouette coefficient is highest represents the best
choice of the number of clusters.

Q24. Which of the following is/are valid iterative strategy for treating missing values before
clustering analysis?

A. Imputation with mean

B. Nearest Neighbor assignment

C. Imputation with Expectation Maximization algorithm

D. All of the above

Solution: (C)
All of the mentioned techniques are valid for treating missing values before clustering analysis
but only imputation with EM algorithm is iterative in its functioning.

Q25. K-Mean algorithm has some limitations. One of the limitation it has is, it makes hard
assignments(A point either completely belongs to a cluster or not belongs at all) of points to
clusters.

Note: Soft assignment can be consider as the probability of being assigned to each cluster:
say K = 3 and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)

Which of the following algorithm(s) allows soft assignments?

1. Gaussian mixture models


2. Fuzzy K-means

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of these

Solution: (C)

Both, Gaussian mixture models and Fuzzy K-means allows soft assignments.

Q26. Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering
algorithm. After first iteration clusters, C1, C2, C3 has following observations:

C1: {(2,2), (4,4), (6,6)}

C2: {(0,4), (4,0)}

C3: {(5,5), (9,9)}

What will be the cluster centroids if you want to proceed for second iteration?

A. C1: (4,4), C2: (2,2), C3: (7,7)


B. C1: (6,6), C2: (4,4), C3: (9,9)

C. C1: (2,2), C2: (0,0), C3: (5,5)

D. None of these

Solution: (A)

Finding centroid for data points in cluster C1 = ((2+4+6)/3, (2+4+6)/3) = (4, 4)

Finding centroid for data points in cluster C2 = ((0+4)/2, (4+0)/2) = (2, 2)

Finding centroid for data points in cluster C3 = ((5+9)/2, (5+9)/2) = (7, 7)

Hence, C1: (4,4), C2: (2,2), C3: (7,7)

Q27. Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering
algorithm. After first iteration clusters, C1, C2, C3 has following observations:

C1: {(2,2), (4,4), (6,6)}

C2: {(0,4), (4,0)}

C3: {(5,5), (9,9)}

What will be the Manhattan distance for observation (9, 9) from cluster centroid C1. In
second iteration.

A. 10

B. 5*sqrt(2)

C. 13*sqrt(2)

D. None of these

Solution: (A)

Manhattan distance between centroid C1 i.e. (4, 4) and (9, 9) = (9-4) + (9-4) = 10

Q28. If two variables V1 and V2, are used for clustering. Which of the following are true
for K means clustering with k =3?
1. If V1 and V2 has a correlation of 1, the cluster centroids will be in a straight line
2. If V1 and V2 has a correlation of 0, the cluster centroids will be in straight line

Options:

A. 1 only

B. 2 only

C. 1 and 2

D. None of the above

Solution: (A)

If the correlation between the variables V1 and V2 is 1, then all the data points will be in a
straight line. Hence, all the three cluster centroids will form a straight line as well.

Q29. Feature scaling is an important step before applying K-Mean algorithm. What is
reason behind this?

A. In distance calculation it will give the same weights for all features

B. You always get the same clusters. If you use or don’t use feature scaling

C. In Manhattan distance it is an important step but in Euclidian it is not

D. None of these

Solution; (A)

Feature scaling ensures that all the features get same weight in the clustering analysis. Consider a
scenario of clustering people based on their weights (in KG) with range 55-110 and height (in
inches) with range 5.6 to 6.4. In this case, the clusters produced without scaling can be very
misleading as the range of weight is much higher than that of height. Therefore, its necessary to
bring them to same scale so that they have equal weightage on the clustering result.

Q30. Which of the following method is used for finding optimal of cluster in K-Mean
algorithm?

A. Elbow method
B. Manhattan method

C. Ecludian mehthod

D. All of the above

E. None of these

Solution: (A)

Out of the given options, only elbow method is used for finding the optimal number of clusters.
The elbow method looks at the percentage of variance explained as a function of the number of
clusters: One should choose a number of clusters so that adding another cluster doesn’t give
much better modeling of the data.

Q31. What is true about K-Mean Clustering?

1. K-means is extremely sensitive to cluster center initializations


2. Bad initialization can lead to Poor convergence speed
3. Bad initialization can lead to bad overall clustering

Options:

A. 1 and 3

B. 1 and 2

C. 2 and 3

D. 1, 2 and 3

Solution: (D)

All three of the given statements are true. K-means is extremely sensitive to cluster center
initialization. Also, bad initialization can lead to Poor convergence speed as well as bad overall
clustering.

Q32. Which of the following can be applied to get good results for K-means algorithm
corresponding to global minima?

1. Try to run algorithm for different centroid initialization


2. Adjust number of iterations
3. Find out the optimal number of clusters

Options:

A. 2 and 3

B. 1 and 3

C. 1 and 2

D. All of above

Solution: (D)

All of these are standard practices that are used in order to obtain good clustering results.

Q33. What should be the best choice for number of clusters based on the following results:
A. 5

B. 6

C. 14

D. Greater than 14

Solution: (B)

Based on the above results, the best choice of number of clusters using elbow method is 6.

Q34. What should be the best choice for number of clusters based on the following results:
A. 2

B. 4

C. 6

D. 8

Solution: (C)

Generally, a higher average silhouette coefficient indicates better clustering quality. In this plot,
the optimal clustering number of grid cells in the study area should be 2, at which the value of
the average silhouette coefficient is highest. However, the SSE of this clustering solution (k = 2)
is too large. At k = 6, the SSE is much lower. In addition, the value of the average silhouette
coefficient at k = 6 is also very high, which is just lower than k = 2. Thus, the best choice is k =
6.

Q35. Which of the following sequences is correct for a K-Means algorithm using Forgy
method of initialization?
1. Specify the number of clusters
2. Assign cluster centroids randomly
3. Assign each data point to the nearest cluster centroid
4. Re-assign each point to nearest cluster centroids
5. Re-compute cluster centroids

Options:

A. 1, 2, 3, 5, 4

B. 1, 3, 2, 4, 5

C. 2, 1, 3, 4, 5

D. None of these

Solution: (A)

The methods used for initialization in K means are Forgy and Random Partition. The Forgy
method randomly chooses k observations from the data set and uses these as the initial means.
The Random Partition method first randomly assigns a cluster to each observation and then
proceeds to the update step, thus computing the initial mean to be the centroid of the cluster’s
randomly assigned points.

Q36. If you are using Multinomial mixture models with the expectation-maximization
algorithm for clustering a set of data points into two clusters, which of the assumptions are
important:

A. All the data points follow two Gaussian distribution

B. All the data points follow n Gaussian distribution (n >2)

C. All the data points follow two multinomial distribution

D. All the data points follow n multinomial distribution (n >2)

Solution: (C)

In EM algorithm for clustering its essential to choose the same no. of clusters to classify the data
points into as the no. of different distributions they are expected to be generated from and also
the distributions must be of the same type.
Q37. Which of the following is/are not true about Centroid based K-Means clustering
algorithm and Distribution based expectation-maximization clustering algorithm:

1. Both starts with random initializations


2. Both are iterative algorithms
3. Both have strong assumptions that the data points must fulfill
4. Both are sensitive to outliers
5. Expectation maximization algorithm is a special case of K-Means
6. Both requires prior knowledge of the no. of desired clusters
7. The results produced by both are non-reproducible.

Options:

A. 1 only

B. 5 only

C. 1 and 3

D. 6 and 7

E. 4, 6 and 7

F. None of the above

Solution: (B)

All of the above statements are true except the 5th as instead K-Means is a special case of EM
algorithm in which only the centroids of the cluster distributions are calculated at each iteration.

Q38. Which of the following is/are not true about DBSCAN clustering algorithm:

1. For data points to be in a cluster, they must be in a distance threshold to a core


point
2. It has strong assumptions for the distribution of data points in dataspace
3. It has substantially high time complexity of order O(n3)
4. It does not require prior knowledge of the no. of desired clusters
5. It is robust to outliers

Options:

A. 1 only

B. 2 only
C. 4 only

D. 2 and 3

E. 1 and 5

F. 1, 3 and 5

Solution: (D)

 DBSCAN can form a cluster of any arbitrary shape and does not have strong assumptions
for the distribution of data points in the dataspace.
 DBSCAN has a low time complexity of order O(n log n) only.

Q39. Which of the following are the high and low bounds for the existence of F-Score?

A. [0,1]

B. (0,1)

C. [-1,1]

D. None of the above

Solution: (A)

The lowest and highest possible values of F score are 0 and 1 with 1 representing that every data
point is assigned to the correct cluster and 0 representing that the precession and/ or recall of the
clustering analysis are both 0. In clustering analysis, high value of F score is desired.

Q40. Following are the results observed for clustering 6000 data points into 3 clusters: A, B
and C:
What is the F1-Score with respect to cluster B?

A. 3

B. 4

C. 5

D. 6

Solution: (D)

Here,

True Positive, TP = 1200

True Negative, TN = 600 + 1600 = 2200

False Positive, FP = 1000 + 200 = 1200

29-Aug-18
Select the correct option which directly achieve multi-class classification (without
support of binary classifiers)
K Nearest Neighbor SVM Neural networks Decision trees

Classification where each data is mapped to more than one class is called
Multi class classification(X) Multi label classification Binary classification

The classification where each data is mapped to more than one class is called Binary Classification.

Images,documents are examples of Unstructured Data Structured Data

The most widely used package for machine learning in python is


Pillow bottle jango sklearn

Sentiment classification is a special task of text classification whose objective is to classify a text
according to the sentimental polarities of opinions it contains (Pang et al., 2002), e.g., favorable or
unfavorable, positive or negative. SciKit-Learn. Scikit-learn is open source machine learning library for
the Python programming language. ..

Imagine you have just finished training a decision tree for spam classication and it
is showing abnormal bad performance on both your training and test sets. Assume
that your implementation has no bugs. What could be reason for this problem
Your decision trees are too shallow.
You need to increase the learning rate
You are overfitting.
All the options

a) Download the dataset from:


https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.

What does the command sentiment_analysis_data['label'].value_counts() return?


counts of unique values in the 'label' column
Number of rows in the dataset
Number of columns in the dataset
The total count of elements in 'label' column

Which numerical statistics is used to identify the importance of a rare word in a


document?It is suppose to be IDFNon
None of the oprtions TF-IDF DF TF

a) Download the dataset from:


https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.

What command should be given to tokenize a sentence into words?


fromnltk import sentence_tokenize, Word_tokens =sentence_tokenize(sentence)
fromnltk.tokenize import word_tokenize, Word_tokens =word_tokenize(sentence)
fromnltk.tokenizer import word_tokenizer, Word_tokens =word_tokenizer(sentence)
fromnltk import tokenize_words, Word_tokens =tokenize_words(sentence)

19/09/2018
Select the correct statements about Nonlinear classification
kernel tricks are used by Nonlinear classifiers to achieve maximum-margin hyperplanes.
The concept of slack variables is used in SVM for Nonlinear classification
kernel trick is used in SVM for non-linear classification

The fit(X, y) is used to


Initialize the classifier
Test the classifier
Train the Classifier
Evaluate the classifier
Model Tuning helps to increase the accuracy
Can't say False True (X)

TF and IDF use matrix representations 2 or TRUE

Identify the stop words from the following "computer"

Both "the" and "it" "fragment" "it" "the"

Which of the given hyper parameter(s), when increased may cause random forest to
over fit the data?
Number of Trees Learning Rate Depth of Tree

Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not an hyperparameter
in random forest. Increase in the number of tree will cause under fitting.

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt

and load it to the variable 'sentiment_analysis_data'.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

What does the command sentiment_analysis_data['label'].value_counts()


return?
Number of columns in the dataset (X)
Number of rows in the dataset
counts of unique values in the 'label' column
The total count of elements in 'label' column

Which of the following is not a preprocessing method used for unstructured data
classification?
confusion_matrix stop word removal lemmatization stemming

Which NLP technique uses lexical knowledge base to obtain the correct base form
of the words?

stop word removal lemmatization Tokenization object standardization


In a Term Frequency Inverse Document Frequency (TFIDF) matrix, the
term importance is expressed by Inverse Document Frequency (IDF).

IDF diminishes the weight of the most commonly occurring words and increases the
weightage of rare words.

In a Document Term Matrix (DTM) each row represents ______?


Document TF value word TF-IDF value
Supervised learning differs from unsupervised learning in that supervised
learning requires
Unlabeled data None of the options Labeled data raw data

An algorithm that counts how many times a word appears in a document


DTM Bag-of-words(BOW) TF-IDF TDM

Clustering is a supervised classification True False

Clustering is a UNsupervised classification

What is the purpose of lemmatization?


To remove reduntant words To split into sentences
To convert a sentence to words To convert words to a proper base form

SVM is a
weakly supervised learning algorithm. supervised learning algorithm.
Semi-supervised learning algorithm. unsupervised learning algorithm.

Can we consider sentiment classification as a text classification problem?


No Yes

Which type of cross validation is used for imbalanced dataset?


K –Fold Leave One Out Stratified Shuffle Split

Cross-validation causes over-fitting. True False

Pruning is a technique associated with


Decision tree Linear regression Logistic regression SVM
What are the advantages of Naive Bayes?

1. It will converge quicker than discriminative models like logistic regression AND it requires less
training data
2. Requires less training data
3. None of the options
4. It will converge quicker than discriminative models like logistic regression

The fit(X, y) is used to

1. Initialize the classifier


2. Train the Classifier
3. Test the classifier
4. Evaluate the classifier

Higher value of which of the following hyper-parameters is better for decision tree algorithm?

1. Cannot say
2. Number of samples used for split
3. Depth of tree
4. Samples for leaf

Which of the given hyper parameter(s), when increased may cause random forest to over fit the
data?

1. Number of Trees
2. Learning Rate
3. Depth of Tree

Choose the correct sequence for classifier building from the following:

1. Initialize -> Train -> Predict -> Evaluate


2. Train -> Test -> Initialize -> Predict
3. Initialize -> Evaluate -> Train -> Predict
4. None of the options

Which numerical statistics is used to identify the importance of a rare word in a document?

1. TF
2. TF-IDF
3. None of the options
4. DF

Supervised learning differs from unsupervised learning in that supervised learning requires

1. Raw data
2. Labeled data
3. Unlabeled data
4. None of the options

Select the correct statements about Nonlinear classification

1. kernel tricks are used by Nonlinear classifiers to achieve maximum-margin hyperplanes.


2. kernel trick is used in SVM for non-linear classification
3. The concept of slack variables is used in SVM for Nonlinear classification

Which NLP technique uses lexical knowledge base to obtain the correct base form of the words?

1. lemmatization
2. tokenization
3. object standarization
4. stop word removal

What is the output of the sentence “Good words bring good feelings to the heart” after
performing tokenization, lemmatization and stop word removal.

1. ['Good', 'words', 'bring', 'good', 'feelings', 'to', 'the', 'heart']


2. ['Good', 'word', 'bring', 'good', 'feeling', 'to', 'the', 'heart']
3. 'Good word bring good feeling heart'
4. 'Good words bring good feelings heart'

Classification where each data is mapped to more than one class is called

1. Binary classification
2. Multi Label Classification
3. Multi Class Classification

email spam data is an example of

1. Structured Data
2. Unstructured Data

SVM is a

1. weakly supervised learning algorithm. (X)


2. Semi-supervised learning algorithm.
3. supervised learning algorithm.
4. unsupervised learning algorithm.

Stemming and lemmatization gives the same result. (true/false)

 false
Which type of cross validation is used for imbalanced dataset?

 Leave One Out


 K -Fold (answer)
 Stratified Shuffle Split

An algorithm that counts how many times a word appears in a document

1. TF-IDF (Term Frequency Inverse Document Frequency)


2. DTM
3. Bag-of-words(BOW)
4. TDM

 answer: 3. Predominantly used for calculating the term (word) frequency or the number of
times a term occurs in a document/sentence.
 The Term Document Matrix (TDM) is a matrix that contains the frequency of occurrence of
terms in a collection of documents.
 In a Term Frequency Inverse Document Frequency (TFIDF) matrix, the term importance is
expressed by Inverse Document Frequency (IDF). IDF diminishes the weight of the most
commonly occurring words and increases the weightage of rare words.

In a Document Term Matrix (DTM) each row represents ______?

A document-term matrix or term-document matrix is a mathematical matrix that describes the


frequency of terms that occur in a collection of documents. In a document-term matrix, rows
correspond to documents in the collection and columns correspond to terms.

Pruning is a technique associated with

1. Logistic regression
2. SVM
3. Linear regression
4. Decision tree

Images and Documents are examples of

 Unstructured data

TF and IDF use matrix representations

 true
 Term Frequency-Inverse Document Frequency

Which of the following is not a pre-processing method used for unstructured data classification?

1. stemming
2. confusion matrix
3. lemmatization
4. stop word removal

Choose the correct sequence from the following:


1. PreProcessing -> Model Building -> Predict
2. Data Analysis -> Pre-Processing -> Model Building -> Predict
3. Data Analysis -> Pre-Processing -> Predict -> Train
4. Pre-Processing -> Predict -> Train

Lemmatization offers better precision than stemming (true)

TF-IDF is a freature extraction technique (true)

Clustering is a supervised classification (true)

Can we consider sentiment classification as a text classification problem? (true)

Which of the following is not a performance evaluation measure?

 Confusion Matrix
 Classification Report
 Decision Tree X
 Accuracy score

Which of the following command is used to view the dataset SIZE and what is the value
returned?
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it
to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.

 sentiment_analysis_data.shape,(7086, 3)

Imagine you have just finished training a decision tree for spam classication and it is showing
abnormal bad performance on both your training and test sets. Assume that your
implementation has no bugs. What could be reason for this problem

 You are overfitting (X)


 Your decision trees are too shallow.
 You need to increase the learning rate
 All the options
The most widely used package for machine learning in python is

 sklearn

What is the tokenized output of the sentence "if you cannot do great things, do small things in a
great way"
A technique used to depict the performance in a tabular form that has 2 dimensions namely
'actual' and 'predicted' sets of data.

 Confusion Matrix

What is the output of the sentence "Good words bring good feelings to the heart" after
performing tokenization, lemmatization and stop word removal.

 'Good word bring good feeling heart'

Can we consider sentiment classification as a text classification problem?

 YES

Select the correct option which directly achieve multi-class classification (without support of
binary classifiers)

 SVM - SVMs are inherently two-class classifiers.


 Neural networks - ??
 Decision trees - Decision trees are a powerful classification technique. The tree tries to infer
a split of the training data based on the values of the available features to produce a good
generalization. The algorithm can naturally handle binary or multi-class classification problems.
 K Nearest Neighbor - k-nearest neighbors kNN is considered among the oldest non-
parametric classification algorithms

To view the first 3 rows of the dataset, which of the following commands are used?
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it
to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.

 sentiment_analysis_data.head(3)

Select pre-processing techniques from the options

 Stopword removal
 Lemmatization
 All the options
 Tokenization
 Stemming

High classification accuracy always indicates a good classifier.

 True

Inverse Document frequency is used in term document matrix.

 TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is
intended to reflect how important a word is to a document in a collection or corpus.

 True

Which one of the following is not a classification technique?

1. SGDClassifier
2. StratifiedShuffleSplit
3. SVM
4. Random Forest

A classifer that can compute using numeric as well as categorical values is

1. Decision Tree Classifier


2. SVM Classifier
3. Random Forest Classifier
4. Naive Bayes Classifier

What is the purpose of lemmatization?

 To convert words in base form

Model Tuning helps to increase the accuracy

 True

What is the output of the following command:


print(sentiment_analysis_data['label'].unique())
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it
to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.

 [yes no]
 None of these
 [1 0] ?
 [true false]
What command should be given to tokenize a sentence into words?
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it
to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
from nltk.tokenize import word_tokenize
Word_tokens =word_tokenize(sentence)

Let's assume, you are solving a classification problem with highly imbalanced class. The majority
class is observed 99% of times in the training data. Which of the following is true when your
model has 99% accuracy after taking the predictions on test data. ?

1. For imbalanced class problems, precision and recall metrics aren’t good.
2. For imbalanced class problems, accuracy metric is a good idea.
3. For imbalanced class problems, accuracy metric is not a good idea.

Which of the following command is used to view the dataset SIZE and what is the
value returned?
sentiment_analysis_data.shape,(7086, 3)
sentiment_analysis_data.size,(7086, 3)
sentiment_analysis_data.size(),(7086, 2)
sentiment_analysis_data.shape(),(7086, 2)

What command should be given to tokenize a sentence into words?


fromnltk.tokenize import word_tokenize, Word_tokens =word_tokenize(sentence)
fromnltk import tokenize_words, Word_tokens =tokenize_words(sentence)
fromnltk.tokenizer import word_tokenizer, Word_tokens =word_tokenizer(sentence)
fromnltk import sentence_tokenize, Word_tokens =sentence_tokenize(sentence)

What is the tokenized output of the sentence “if you cannot do great things, do small
things in a great way”
'Only', 'heart', 'tells'
'Only', 'do', 'what', 'your', 'heart', 'tell', 'you' (X)
'Only', 'do', 'what', 'heart', 'tells'
'Only', 'do', 'what', 'your', 'heart', 'tells', 'you'
Choose the correct sequence from the following:
Data Analysis ->PreProcessing -> Model Building--> Predict
PreProcessing -> Predict-->Train XX
PreProcessing -> Model Building--> Predict XX
Data Analysis ->PreProcessing -> Predict--> Train

Which of the given hyper parameter(s), when increased may cause random forest to over fit the
data?

1. Number of Trees
2. Learning Rate
3. Depth of Tree

What kind of classification is the given case study(Sentiment Analysis dataset)?


Multi class classification Multi label classification Binary classification(X)

You might also like