Data Mining Technical
Data Mining Technical
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
Finally, we will need to iterate over steps 4 – 7 multiple times before we come up with our refined
model.
Variable Identification
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and category
of the variables.
Example:- Suppose, we want to predict, whether the students will play cricket or not (refer below data
set). Here you need to identify predictor variables, target variable, data type of variables and category
of variables.
At this stage, we explore variables one by one. Method to perform uni-variate analysis will depend on
whether the variable type is categorical or continuous. Let’s look at these methods and statistical
measures for categorical and continuous variables individually:
Continuous Variables:- In case of continuous variables, we need to understand the central tendency
and spread of the variable. These are measured using various statistical metrics visualization methods
as shown below:
Note: Univariate analysis is also used to highlight missing and outlier values. In the upcoming part of this series, we
will look at methods to handle missing and outlier values.
Categorical Variables: For categorical variables, we’ll use frequency table to understand distribution
of each category. We can also read as percentage of values under each category. It can be be
measured using two metrics, Count and Count% against each category. Bar chart can be used as
visualization.
Bi-variate Analysis
Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and
disassociation between variables at a pre-defined significance level. We can perform bi-variate analysis
for any combination of categorical and continuous variables. The combination can be: Categorical &
Categorical, Categorical & Continuous and Continuous & Continuous. Different methods are used to
tackle these combinations during analysis process.
Let’s understand the possible combinations in detail:
Continuous & Continuous: While doing bi-variate analysis between two continuous variables, we
should look at scatter plot. It is a nifty way to find out the relationship between two variables. The pattern
of scatter plot indicates the relationship between variables. The relationship can be linear or non-linear.
Scatter plot shows the relationship between two variable but does not indicates the strength of
relationship amongst them. To find the strength of the relationship, we use Correlation. Correlation
varies between -1 and +1.
Various tools have function or functionality to identify correlation between variables. In Excel, function
CORREL() is used to return the correlation between two variables and SAS uses procedure PROC
CORR to identify the correlation. These function returns Pearson Correlation value to identify the
relationship between two variables:
In above example, we have good positive relationship(0.65) between two variables X and Y.
Categorical & Categorical: To find the relationship between two categorical variables, we can use
following methods:
Two-way table: We can start analyzing the relationship by creating a two-way table of count
and count%. The rows represents the category of one variable and the columns represent the
categories of the other variable. We show count or count% of observations available in each
combination of row and column categories.
Stacked Column Chart: This method is more of a visual form of Two-way table.
Chi-Square Test: This test is used to derive the statistical significance of relationship between
the variables. Also, it tests whether the evidence in the sample is strong enough to generalize
that the relationship for a larger population as well. Chi-square is based on the difference
between the expected and observed frequencies in one or more categories in the two-way
table. It returns probability for the computed chi-square distribution with the degree of freedom.
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95%
confidence. The chi-square test statistic for a test of independence of two categorical variables is found
by:
From previous two-way table, the expected count for product category 1 to be of small size is 0.22. It
is derived by taking the row total for Size (9) times the column total for Product category (2) then dividing
by the sample size (81). This is procedure is conducted for each cell. Statistical Measures used to
analyze the power of relationship are:
Categorical & Continuous: While exploring relation between categorical and continuous variables, we
can draw box plots for each level of categorical variables. If levels are small in number, it will not show
the statistical significance. To look at the statistical significance we can perform Z-test, T-test or ANOVA.
Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically different from
ANOVA:- It assesses whether the average of more than two groups is statistically different.
Example: Suppose, we want to test the effect of five different exercises. For this, we recruit 20 men
and assign one type of exercise to 4 men (5 groups). Their weights are recorded after a few
weeks. We need to find out whether the effect of these exercises on them is significantly different or
not. This can be done by comparing the weights of the 5 groups of 4 men each.
Till here, we have understood the first three stages of Data Exploration, Variable Identification, Uni-
Variate and Bi-Variate analysis. We also looked at various statistical and visual methods to identify the
relationship between variables.
Now, we will look at the methods of Missing values Treatment. More importantly, we will also look at
why missing values occur in our data and why treating them is necessary.
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model
because we have not analysed the behavior and relationship with other variables correctly. It can lead
to wrong prediction or classification.
Notice the missing values in the image shown above: In the left scenario, we have not treated missing
values. The inference from this data set is that the chances of playing cricket by males is higher than
females. On the other hand, if you look at the second table, which shows data after treatment of missing
values (based on gender), we can see that females have higher chances of playing cricket compared
to males.
We looked at the importance of treatment of missing values in a dataset. Now, let’s identify the reasons
for occurrence of these missing values. They may occur at two stages:
1. Data Extraction: It is possible that there are problems with extraction process. In such cases,
we should double-check for correct data with data guardians. Some hashing procedures can
also be used to make sure data extraction is correct. Errors at data extraction stage are typically
easy to find and can be corrected easily as well.
2. Data collection: These errors occur at time of data collection and are harder to correct. They
can be categorized in four types:
o Missing completely at random: This is a case when the probability of missing
variable is same for all observations. For example: respondents of data collection
process decide that they will declare their earning after tossing a fair coin. If an head
occurs, respondent declares his / her earnings & vice versa. Here each observation
has equal chance of missing value.
o Missing at random: This is a case when variable is missing at random and missing
ratio varies for different values / level of other input variables. For example: We are
collecting data for age and female has higher missing value compare to male.
o Missing that depends on unobserved predictors: This is a case when the
missing values are not random and are related to the unobserved input variable. For
example: In a medical study, if a particular diagnostic causes discomfort, then there is
higher chance of drop out from the study. This missing value is not at random unless
we have included “discomfort” as an input variable for all patients.
o Missing that depends on the missing value itself: This is a case when the
probability of missing value is directly correlated with missing value itself. For example:
People with higher or lower income are likely to provide non-response to their earning.
1. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
o In list wise deletion, we delete observations where any of the variable is missing.
Simplicity is one of the major advantage of this method, but this method reduces the
power of model because it reduces the sample size.
o In pair wise deletion, we perform analysis with all cases in which the variables of
interest are present. Advantage of this method is, it keeps as many cases available
for analysis. One of the disadvantage of this method, it uses different sample size for
different variables.
o Deletion methods are used when the nature of missing data is “Missing completely
at random” else non random missing values can bias the model output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with
estimated ones. The objective is to employ known relationships that can be identified in the
valid values of the data set to assist in estimating the missing values. Mean / Mode / Median
imputation is one of the most frequently used methods. It consists of replacing the missing data
for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute)
of all known values of that variable. It can be of two types:-
o Generalized Imputation: In this case, we calculate the mean or median for all non
missing values of that variable then replace missing value with mean or median. Like
in above table, variable “Manpower” is missing so we take average of all non missing
values of “Manpower” (28.33) and then replace missing value with it.
o Similar case Imputation: In this case, we calculate average for gender “Male” (29.75)
and “Female” (25) individually of non missing values then replace the missing value
based on gender. For “Male“, we will replace missing values of manpower with 29.75
and for “Female” with 25.
3. Prediction Model: Prediction model is one of the sophisticated method for handling missing
data. Here, we create a predictive model to estimate values that will substitute the missing
data. In this case, we divide our data set into two sets: One set with no missing values for the
variable and another one with missing values. First data set become training data set of the
model while second data set with missing values is test data set and variable with missing
values is treated as target variable. Next, we create a model to predict target variable based on
other attributes of the training data set and populate missing values of test data set.We can use
regression, ANOVA, Logistic regression and various modeling technique to perform this. There
are 2 drawbacks for this approach:
1. The model estimated values are usually more well-behaved than the true values
2. If there are no relationships with attributes in the data set and the attribute with missing
values, then the model will not be precise for estimating missing values.
4. KNN Imputation: In this method of imputation, the missing values of an attribute are imputed
using the given number of attributes that are most similar to the attribute whose values are
missing. The similarity of two attributes is determined using a distance function. It is also known
to have certain advantage & disadvantages.
o Advantages:
k-nearest neighbour can predict both qualitative & quantitative attributes
Creation of predictive model for each attribute with missing data is not required
Attributes with multiple missing values can be easily treated
Correlation structure of the data is taken into consideration
o Disadvantage:
KNN algorithm is very time-consuming in analyzing large database. It searches
through all the dataset looking for the most similar instances.
Choice of k-value is very critical. Higher value of k would include attributes
which are significantly different from what we need whereas lower value of k
implies missing out of significant attributes.
After dealing with missing values, the next task is to deal with outliers. Often, we tend to neglect
outliers while building models. This is a discouraging practice. Outliers tend to make your data
skewed and reduces accuracy. Let’s learn more about outlier treatment.
What is an Outlier?
Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else
it can result in wildly wrong estimations. Simply speaking, Outlier is an observation that appears far
away and diverges from an overall pattern in a sample.
Let’s take an example, we do customer profiling and find out that the average annual income of
customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2 million.
These two customers annual income is much higher than rest of the population. These two observations
will be seen as Outliers.
What are the types of Outliers?
Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of
univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-
variate outliers are outliers in an n-dimensional space. In order to find them, you have to look at
distributions in multi-dimensions.
Let us understand this with an example. Let us say we are understanding the relationship between
height and weight. Below, we have univariate and bivariate distribution for Height, Weight. Take a look
at the box plot. We do not have any outlier (above and below 1.5*IQR, most common method). Now
look at the scatter plot. Here, we have two values below and one above the average in a specific
segment of weight and height.
Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having
these outliers. The method to deal with them would then depend on the reason of their occurrence.
Causes of outliers can be classified in two broad categories:
Data Entry Errors:- Human errors such as errors caused during data collection, recording, or
entry can cause outliers in data. For example: Annual income of a customer is $100,000.
Accidentally, the data entry operator puts an additional zero in the figure. Now the income
becomes $1,000,000 which is 10 times higher. Evidently, this will be the outlier value when
compared with rest of the population.
Measurement Error: It is the most common source of outliers. This is caused when the
measurement instrument used turns out to be faulty. For example: There are 10 weighing
machines. 9 of them are correct, 1 is faulty. Weight measured by people on the faulty machine
will be higher / lower than the rest of people in the group. The weights measured on faulty
machine can lead to outliers.
Experimental Error: Another cause of outliers is experimental error. For example: In a 100m
sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him
to start late. Hence, this caused the runner’s run time to be more than other runners. His total
run time can be an outlier.
Intentional Outlier: This is commonly found in self-reported measures that involves sensitive
data. For example: Teens would typically under report the amount of alcohol that they consume.
Only a fraction of them would report actual value. Here actual values might look like outliers
because rest of the teens are under reporting the consumption.
Data Processing Error: Whenever we perform data mining, we extract data from multiple
sources. It is possible that some manipulation or extraction errors may lead to outliers in the
dataset.
Sampling error: For instance, we have to measure the height of athletes. By mistake, we
include a few basketball players in the sample. This inclusion is likely to cause outliers in the
dataset.
Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For instance:
In my last assignment with one of the renowned insurance company, I noticed that the
performance of top 50 financial advisors was far higher than rest of the population. Surprisingly,
it was not due to any error. Hence, whenever we perform any data mining activity with advisors,
we used to treat this segment separately.
Outliers can drastically change the results of the data analysis and statistical modeling. There are
numerous unfavourable impacts of outliers in the data set:
It increases the error variance and reduces the power of statistical tests
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest
They can also impact the basic assumption of Regression, ANOVA and other statistical model
assumptions.
To understand the impact deeply, let’s take an example to check what happens to a data set with and
without outliers in the data set.
Example:
As you can see, data set with outliers has significantly different mean and standard deviation. In the
first scenario, we will say that average is 5.45. But with the outlier, average soars to 30. This would
change the estimate completely.
Most commonly used method to detect outliers is visualization. We use various visualization methods,
like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for
visualization). Some analysts also various thumb rules to detect outliers. Some of them are:
Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
Use capping methods. Any value which out of range of 5th and 95th percentile can be
considered as outlier
Data points, three or more standard deviation away from mean are considered outlier
Outlier detection is merely a special case of the examination of data for influential data points
and it also depends on the business understanding
Bivariate and multivariate outliers are typically measured using either an index of influence or
leverage, or distance. Popular indices such as Mahalanobis’ distance and Cook’s D are
frequently used to detect outliers.
In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential
observation, we also look at statistical measure like STUDENT, COOKD, RSTUDENT and
others.
Most of the ways to deal with outliers are similar to the methods of missing values like deleting
observations, transforming them, binning them, treat them as a separate group, imputing values and
other statistical methods. Here, we will discuss the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error, data processing error
or outlier observations are very small in numbers. We can also use trimming at both ends to remove
outliers.
Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of
a value reduces the variation caused by extreme values. Binning is also a form of variable
transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We
can also use the process of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median,
mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial.
If it is artificial, we can go with imputing values. We can also use statistical model to predict values of
outlier observation and after that we can impute it with predicted values.
Treat separately: If there are significant number of outliers, we should treat them separately in the
statistical model. One of the approach is to treat both groups as two different groups and build individual
model for both groups and then combine the output.
Till here, we have learnt about steps of data exploration, missing value treatment and techniques of
outlier detection and treatment. These 3 stages will make your raw data better in terms of information
availability and accuracy. Let’s now proceed to the final stage of data exploration. It is Feature
Engineering.
Feature engineering is the science (and art) of extracting more information from existing data. You are
not adding any new data here, but you are actually making the data you already have more useful.
For example, let’s say you are trying to predict foot fall in a shopping mall based on dates. If you try and
use the dates directly, you may not be able to extract meaningful insights from the data. This is because
the foot fall is less affected by the day of the month than it is by the day of the week. Now this information
about day of week is implicit in your data. You need to bring it out to make your model better.
This exercising of bringing out information from data in known as feature engineering.
You perform feature engineering once you have completed the first 5 steps in data exploration
– Variable Identification, Univariate, Bivariate Analysis, Missing Values Imputation and Outliers
Treatment. Feature engineering itself can be divided in 2 steps:
Variable transformation.
Variable / Feature creation.
These two techniques are vital in data exploration and have a remarkable impact on the power of
prediction. Let’s understand each of this step in more details.
In data modelling, transformation refers to the replacement of a variable by a function. For instance,
replacing a variable x by the square / cube root or logarithm x is a transformation. In other words,
transformation is a process that changes the distribution or relationship of a variable with others.
Symmetric distribution is preferred over skewed distribution as it is easier to interpret and
generate inferences. Some modeling techniques requires normal distribution of variables. So,
whenever we have a skewed distribution, we can use transformations which reduce skewness.
For right skewed distribution, we take square / cube root or logarithm of variable and for left
skewed, we take square / cube or exponential of variables.
There are various methods used to transform variables. As discussed, some of them include square
root, cube root, logarithmic, binning, reciprocal and many others. Let’s look at these methods in
detail by highlighting the pros and cons of these transformation methods.
Logarithm: Log of a variable is a common transformation method used to change the shape
of distribution of the variable on a distribution plot. It is generally used for reducing right
skewness of variables. Though, It can’t be applied to zero or negative values as well.
Square / Cube root: The square and cube root of a variable has a sound effect on variable
distribution. However, it is not as significant as logarithmic transformation. Cube root has its
own advantage. It can be applied to negative values including zero. Square root can be applied
to positive values including zero.
Feature / Variable creation is a process to generate a new variables / features based on existing
variable(s). For example, say, we have date(dd-mm-yy) as an input variable in a data set. We can
generate new variables like day, month, year, week, weekday that may have better relationship with
target variable. This step is used to highlight the hidden relationship in a variable:
There are various techniques to create new features. Let’s look at the some of the commonly used
methods:
Creating derived variables: This refers to creating new variables from existing variable(s)
using set of functions or different methods. Let’s look at it through “Titanic – Kaggle
competition”. In this data set, variable age has missing values. To predict missing values,
we used the salutation (Master, Mr, Miss, Mrs) of name as a new variable. How do we decide
which variable to create? Honestly, this depends on business understanding of the analyst, his
curiosity and the set of hypothesis he might have about the problem. Methods such as taking
log of variables, binning variables and other methods of variable transformation can also be
used to create new variables.
Creating dummy variables: One of the most common application of dummy variable
is to convert categorical variable into numerical variables. Dummy variables are also called
Indicator Variables. It is useful to take categorical variable as a predictor in statistical
models. Categorical variable can take values 0 and 1. Let’s take a variable ‘gender’. We can
produce two variables, namely, “Var_Male” with values 1 (Male) and 0 (No male) and
“Var_Female” with values 1 (Female) and 0 (No Female). We can also create dummy variables
for more than two classes of a categorical variables with n or n-1 dummy variables.
Machine Learning Algorithms
Broadly, there are 3 types of Machine Learning Algorithms
1. Supervised Learning
How it works: This algorithm consist of a target / outcome variable (or dependent variable) which is to
be predicted from a given set of predictors (independent variables). Using these set of variables,
we generate a function that map inputs to desired outputs. The training process continues until the
model achieves a desired level of accuracy on the training data. Examples of Supervised Learning:
Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.
2. Unsupervised Learning
How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate. It
is used for clustering population in different groups, which is widely used for segmenting customers in
different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-
means.
3. Reinforcement Learning:
How it works: Using this algorithm, the machine is trained to make specific decisions. It works this
way: the machine is exposed to an environment where it trains itself continually using trial and error.
This machine learns from past experience and tries to capture the best possible knowledge to make
accurate business decisions. Example of Reinforcement Learning: Markov Decision Process
Here is the list of commonly used machine learning algorithms. These algorithms can be applied to
almost any data problem:
1. Linear Regression
It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous
variable(s). Here, we establish relationship between independent and dependent variables by fitting a
best line. This best fit line is known as regression line and represented by a linear equation Y= a *X +
b.
The best way to understand linear regression is to relive this experience of childhood. Let us say, you
ask a child in fifth grade to arrange people in his class by increasing order of weight, without asking
them their weights! What do you think the child will do? He / she would likely look (visually analyze) at
the height and build of people and arrange them using a combination of these visible parameters. This
is linear regression in real life! The child has actually figured out that height and build would be
correlated to the weight by a relationship, which looks like the equation above.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
These coefficients a and b are derived based on minimizing the sum of squared difference of distance
between data points and regression line.
Look at the below example. Here we have identified the best fit line having linear
equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a
person.
Linear Regression is of mainly two types: Simple Linear Regression and Multiple Linear Regression.
Simple Linear Regression is characterized by one independent variable. And, Multiple Linear
Regression(as the name suggests) is characterized by multiple (more than 1) independent
variables. While finding best fit line, you can fit a polynomial or curvilinear regression. And these are
known as polynomial or curvilinear regression.
Python Code
#Import Library
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)
#Predict Output
predicted= linear.predict(x_test)
R Code
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
summary(linear)
#Predict Output
predicted= predict(linear,x_test)
2. Logistic Regression
Don’t get confused by its name! It is a classification not a regression algorithm. It is used to estimate
discrete values (Binary values like 0/1, yes/no, true/false ) based on given set of independent
variable(s). In simple words, it predicts the probability of occurrence of an event by fitting data to a logit
function. Hence, it is also known as logit regression. Since, it predicts the probability, its output values
lies between 0 and 1 (as expected).
Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve
it or you don’t. Now imagine, that you are being given wide range of puzzles / quizzes in an attempt to
understand which subjects you are good at. The outcome to this study would be something like this –
if you are given a trignometry based tenth grade problem, you are 70% likely to solve it. On the other
hand, if it is grade fifth history question, the probability of getting an answer is only 30%. This is what
Logistic Regression provides you.
Coming to the math, the log odds of the outcome is modeled as a linear combination of the predictor
variables.
ln(odds) = ln(p/(1-p))
Above, p is the probability of presence of the characteristic of interest. It chooses parameters that
maximize the likelihood of observing the sample values rather than that minimize the sum of squared
errors (like in ordinary regression).
Now, you may ask, why take a log? For the sake of simplicity, let’s just say that this is one of the best
mathematical way to replicate a step function. I can go in more details, but that will beat the purpose of
this article.
Python Code
#Import Library
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dat
aset
model = LogisticRegression()
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
R Code
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)
Furthermore..
There are many different steps that could be tried in order to improve the model:
3. Decision Tree
This is one of my favorite algorithm and I use it quite frequently. It is a type of supervised learning
algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and
continuous dependent variables. In this algorithm, we split the population into two or more
homogeneous sets. This is done based on most significant attributes/ independent variables to make
as distinct groups as possible. For more details, you can read: Decision Tree Simplified.
In the image above, you can see that population is classified into four different groups based on
multiple attributes to identify ‘if they will play or not’. To split the population into different heterogeneous
groups, it uses various techniques like Gini, Information Gain, Chi-square, entropy.
The best way to understand how decision tree works, is to play Jezzball – a classic game from Microsoft
(image below). Essentially, you have a room with moving walls and you need to create walls such that
maximum area gets cleared off with out the balls.
So, every time you split the room with a wall, you are trying to create 2 different populations with in the
same room. Decision trees work in very similar fashion by dividing a population in as different groups
as possible.
Python Code
#Import Library
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dat
aset
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algori
thm as gini or entropy (information gain) by default it is gini
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
R Code
library(rpart)
x <- cbind(x_train,y_train)
# grow tree
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional space
(where n is number of features you have) with the value of each feature being the value of a particular
coordinate.
For example, if we only had two features like Height and Hair length of an individual, we’d first plot these
two variables in two dimensional space where each point has two co-ordinates (these co-ordinates are
known as Support Vectors)
Now, we will find some line that splits the data between the two differently classified groups of data.
This will be the line such that the distances from the closest point in each of the two groups will be
farthest away.
In the example shown above, the line which splits the data into two differently classified groups is
the black line, since the two closest points are the farthest apart from the line. This line is our classifier.
Then, depending on where the testing data lands on either side of the line, that’s what class we can
classify the new data as.
Think of this algorithm as playing JezzBall in n-dimensional space. The tweaks in the game are:
You can draw lines / planes at any angles (rather than just horizontal or vertical as in classic
game)
The objective of the game is to segregate balls of different colors in different rooms.
And the balls are not moving.
Python Code
#Import Library
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dat
aset
model = svm.svc() # there is various option associated with it, this is simple for classification. You can
refer link, for mo# re detail.
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
R Code
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-svm(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
5. Naive Bayes
Naive Bayesian model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:
Here,
Example: Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’. Now, we need to classify whether players will play or not based on
weather condition. Let’s follow the below steps to perform it.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of prediction.
We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P
(Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple classes.
Python Code
#Import Library
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dat
aset
# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial
classes like Bernoulli Naive Bayes, Refer link
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
R Code
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
It can be used for both classification and regression problems. However, it is more widely used in
classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available
cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the
class is most common amongst its K nearest neighbors measured by a distance function.
These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three
functions are used for continuous function and fourth one (Hamming) for categorical variables. If K = 1,
then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to
be a challenge while performing kNN modeling.
KNN can easily be mapped to our real lives. If you want to learn about a person, of whom you have no
information, you might like to find out about his close friends and the circles he moves in and gain
access to his/her information!
Python Code
#Import Library
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dat
aset
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
R Code
library(knn)
x <- cbind(x_train,y_train)
# Fitting model
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
7. K-Means
It is a type of unsupervised algorithm which solves the clustering problem. Its procedure follows a
simple and easy way to classify a given data set through a certain number of clusters (assume k
clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.
Remember figuring out shapes from ink blots? k means is somewhat similar this activity. You look at
the shape and spread to decipher how many different clusters / population are present!
In K-means, we have clusters and each cluster has its own centroid. Sum of square of difference
between centroid and the data points within a cluster constitutes within sum of square value for that
cluster. Also, when the sum of square values for all the clusters are added, it becomes total within sum
of square value for the cluster solution.
We know that as the number of cluster increases, this value keeps on decreasing but if you plot the
result you may see that the sum of squared distance decreases sharply up to some value of k, and then
much more slowly after that. Here, we can find the optimum number of cluster.
Python Code
#Import Library
#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset
# Train the model using the training sets and check score
model.fit(X)
#Predict Output
predicted= model.predict(x_test)
R Code
library(cluster)
8. Random Forest
Random Forest is a trademark term for an ensemble of decision trees. In Random Forest,
we’ve collection of decision trees (so known as “Forest”). To classify a new object based on attributes,
each tree gives a classification and we say the tree “votes” for that class. The forest chooses the
classification having the most votes (over all the trees in the forest).
1. If the number of cases in the training set is N, then sample of N cases is taken at random
but with replacement. This sample will be the training set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m variables
are selected at random out of the M and the best split on these m is used to split the node. The
value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
Python
#Import Library
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dat
aset
model= RandomForestClassifier()
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
R Code
library(randomForest)
x <- cbind(x_train,y_train)
# Fitting model
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
In the last 4-5 years, there has been an exponential increase in data capturing at every possible stages.
Corporates/ Government Agencies/ Research organisations are not only coming with new sources but
also they are capturing data in great detail.
For example: E-commerce companies are capturing more details about customer like their
demographics, web crawling history, what they like or dislike, purchase history, feedback and many
others to give them personalized attention more than your nearest grocery shopkeeper.
As a data scientist, the data we are offered also consist of many features, this sounds good for building
good robust model but there is a challenge. How’d you identify highly significant variable(s) out 1000 or
2000? In such cases, dimensionality reduction algorithm helps us along with various other algorithms
like Decision Tree, Random Forest, PCA, Factor Analysis, Identify based on correlation matrix, missing
value ratio and others.
Python Code
#Import Library
#Assumed you have training and test data set as train and test
#fa= decomposition.FactorAnalysis()
train_reduced = pca.fit_transform(train)
test_reduced = pca.transform(test)
library(stats)
10.1. GBM
GBM is a boosting algorithm used when we deal with plenty of data to make a prediction with high
prediction power. Boosting is actually an ensemble of learning algorithms which combines the
prediction of several base estimators in order to improve robustness over a single estimator. It combines
multiple weak or average predictors to a build strong predictor. These boosting algorithms always work
well in data science competitions like Kaggle, AV Hackathon, CrowdAnalytix.
Python Code
#Import Library
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dat
aset
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
R Code
library(caret)
x <- cbind(x_train,y_train)
# Fitting model
10.2. XGBoost
Another classic gradient boosting algorithm that’s known to be the decisive choice between winning
and losing in some Kaggle competitions.
The XGBoost has an immensely high predictive power which makes it the best choice for accuracy in
events as it possesses both linear model and the tree learning algorithm, making the algorithm almost
10x faster than existing gradient booster techniques.
The support includes various objective functions, including regression, classification and ranking.
One of the most interesting things about the XGBoost is that it is also called a regularized boosting
technique. This helps to reduce overfit modelling and has a massive support for a range of languages
such as Scala, Java, R, Python, Julia and C++.
Supports distributed and widespread training on many machines that encompass GCE, AWS, Azure
and Yarn clusters. XGBoost can also be integrated with Spark, Flink and other cloud dataflow
systems with a built in cross validation at each iteration of the boosting process.
Python Code:
X = dataset[:,0:10]
Y = dataset[:,10:]
seed = 1
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
R Code:
require(caret)
x <- cbind(x_train,y_train)
# Fitting model
OR
10.3. LightGBM
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to
be distributed and efficient with the following advantages:
Since the LightGBM is based on decision tree algorithms, it splits the tree leaf wise with the best fit
whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So
when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the
level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any
of the existing boosting algorithms.
Python Code:
test_data = train_data.create_valid('test.svm')
param['metric'] = 'auc'
num_round = 10
bst.save_model('model.txt')
ypred = bst.predict(data)
R Code:
library(RLightGBM)
data(example.binary)
#Parameters
lgbm.data.setField(handle.data, "label", y)
lgbm.booster.train(handle.booster, num_iterations, 5)
#Predict
#Test accuracy
If you’re familiar with the Caret package in R, this is another way of implementing the LightGBM.
require(caret)
require(RLightGBM)
data(iris)
model <-caretModel.LGBM()
library(Matrix)
fit <- train(data.frame(idx = 1:nrow(iris)), iris$Species, method = model.sparse, matrix = mat, verbosity
= 0)
print(fit)
10.4. Catboost
CatBoost is a recently open-sourced machine learning algorithm from Yandex. It can easily integrate
with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML.
The best part about CatBoost is that it does not require extensive data training like other ML models,
and can work on a variety of data formats; not undermining how robust it can be.
Make sure you handle missing data well before you proceed with the implementation.
Catboost can automatically deal with categorical variables without showing the type conversion error,
which helps you to focus on tuning your model better rather than sorting out trivial errors.
Python Code:
import pandas as pd
import numpy as np
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
#Imputing missing values for both train and test
train.fillna(-999, inplace=True)
test.fillna(-999,inplace=True)
#Creating a training set for modeling and validation set to check model performance
X = train.drop(['Item_Outlet_Sales'], axis=1)
y = train.Item_Outlet_Sales
submission = pd.DataFrame()
submission['Item_Identifier'] = test['Item_Identifier']
submission['Outlet_Identifier'] = test['Outlet_Identifier']
submission['Item_Outlet_Sales'] = model.predict(test)
R Code:
set.seed(1)
require(titanic)
require(caret)
require(catboost)
tt <- titanic::titanic_train[complete.cases(titanic::titanic_train),]
print(report)
print(importance)
Additional Reading
Artificial intelligence (AI), deep learning, and neural networks
Artificial intelligence (AI), deep learning, and neural networks represent incredibly exciting and powerful
machine learning-based techniques used to solve many real-world problems.
While human-like deductive reasoning, inference, and decision-making by a computer is still a long time
away, there have been remarkable gains in the application of AI techniques and associated algorithms.
The primary motivation and driving force for these areas of study, and for developing these techniques
further, is that the solutions required to solve certain problems are incredibly complicated, not well
understood, nor easy to determine manually.
Increasingly, we rely on these techniques and machine learning to solve these problems for us, without
requiring explicit programming instructions. This is critical for two reasons. The first is that we likely
wouldn’t be able, or at least know how to write the programs required to model and solve many problems
that AI techniques are able to solve. Second, even if we did know how to write the programs, they would
be inordinately complex and nearly impossible to get right.
Luckily for us, machine learning and AI algorithms, along with properly selected and prepared training
data, are able to do this for us.
Intelligence can be generally described as the ability to perceive information, and retain it as knowledge
to be applied towards adaptive behaviors within an environment or context.
While there are many different definitions of intelligence, they all essentially involve learning,
understanding, and the application of the knowledge learned to achieve one or more goals.
It’s therefore a natural extension to say that AI can be described as intelligence exhibited by machines.
So what does that mean exactly, when is it useful, and how does it work?
A familiar instance of an AI solution includes IBM’s Watson, which was made famous by beating the two
greatest Jeopardy champions in history, and is now being used as a question answering computing
system for commercial applications. Apple’s Siri andAmazon’s Alexa are similar examples as well.
In addition to speech recognition and natural language (processing, generation, and understanding)
applications, AI is also used for other recognition tasks (pattern, text, audio, image, video, facial, …),
autonomous vehicles, medical diagnoses, gaming, search engines, spam filtering, crime fighting,
marketing, robotics, remote sensing, computer vision, transportation, music recognition, classification,
and so on.
Something worth mentioning is a concept known as the AI effect. This describes the case where once
an AI application has become somewhat mainstream, it’s no longer considered by many as AI. It
happens because people’s tendency is to no longer think of the solution as involving real intelligence,
and only being a application of normal computing.
This despite the fact that these applications still fit the definition of AI regardless of widespread usage.
The key takeaway here is that today’s AI is not necessarily tomorrow’s AI, at least not in some people’s
minds anyway.
There are many different goals of AI as mentioned, with different techniques used for each. The primary
topics of this article are artificial neural networks and an advanced version known as deep learning.
The human brain is exceptionally complex and quite literally the most powerful computing machine
known.
The inner-workings of the human brain are often modeled around the concept ofneurons and the
networks of neurons known as biological neural networks. According to Wikipedia, it’s estimated that the
human brain contains roughly 100 billion neurons, which are connected along pathways throughout
these networks.
At a very high level, neurons interact and communicate with one another through an interface consisting
of axon terminals that are connected to dendrites across a gap (synapse) as shown here.
In plain english, a single neuron will pass a message to another neuron across this interface if the sum
of weighted input signals from one or more neurons (summation) into it is great enough (exceeds
a threshold) to cause the message transmission. This is called activation when the threshold is exceeded
and the message is passed along to the next neuron.
The summation process can be mathematically complex. Each neuron’s input signal is actually
a weighted combination of potentially many input signals, and the weighting of each input means that that
input can have a different influence on any subsequent calculations, and ultimately on the final output
of the entire network.
In addition, each neuron applies a function or transformation to the weighted inputs, which means that
the combined weighted input signal is transformed mathematically prior to evaluating if the activation
threshold has been exceeded. This combination of weighted input signals and the functions applied are
typically either linear or nonlinear.
These input signals can originate in many ways, with our senses being some of the most important, as
well as ingestion of gases (breathing), liquids (drinking), and solids (eating) for example. A single neuron
may receive hundreds of thousands of input signals at once that undergo the summation process to
determine if the message gets passed along, and ultimately causes the brain to instruct actions,
memory recollection, and so on.
The ‘thinking’ or processing that our brain carries out, and the subsequent instructions given to our
muscles, organs, and body are the result of these neural networks in action. In addition, the brain’s
neural networks continuously change and update themselves in many ways, including modifications to
the amount of weighting applied between neurons. This happens as a direct result of learning and
experience.
Given this, it’s a natural assumption that for a computing machine to replicate the brain’s functionality
and capabilities, including being ‘intelligent’, it must successfully implement a computer-based or
artificial version of this network of neurons.
This is the genesis of the advanced statistical technique and term known as artificial neural networks.
Artificial neural networks (ANNs) are statistical models directly inspired by, and partially modeled on
biological neural networks. They are capable of modeling and processing nonlinear relationships
between inputs and outputs in parallel. The related algorithms are part of the broader field of machine
learning, and can be used in many applications as discussed.
Artificial neural networks are characterized by containing adaptive weights along paths between neurons
that can be tuned by a learning algorithm that learns from observed data in order to improve the model.
In addition to the learning algorithm itself, one must choose an appropriate cost function.
The cost function is what’s used to learn the optimal solution to the problem being solved. This involves
determining the best values for all of the tunable model parameters, with neuron path adaptive weights
being the primary target, along with algorithm tuning parameters such as the learning rate. It’s usually
done throughoptimization techniques such as gradient descent or stochastic gradient descent.
These optimization techniques basically try to make the ANN solution be as close as possible to the
optimal solution, which when successful means that the ANN is able to solve the intended problem with
high performance.
Architecturally, an artificial neural network is modeled using layers of artificial neurons, or computational
units able to receive input and apply an activation function along with a threshold to determine if
messages are passed along.
In a simple model, the first layer is the input layer, followed by one hidden layer, and lastly by
an output layer. Each layer can contain one or more neurons.
Models can become increasingly complex, and with increased abstraction and problem solving
capabilities by increasing the number of hidden layers, the number of neurons in any given layer, and/or
the number of paths between neurons. Note that an increased chance of overfitting can also occur with
increased model complexity.
Model architecture and tuning are therefore major components of ANN techniques, in addition to the
actual learning algorithms themselves. All of these characteristics of an ANN can have significant impact
on the performance of the model.
Additionally, models are characterized and tunable by the activation function used to convert a neuron’s
weighted input to its output activation. There are many different types of transformations that can be
used as the activation function, and a discussion of them is out of scope for this article.
The abstraction of the output as a result of the transformations of input data through neurons and layers
is a form of distributed representation, as contrasted withlocal representation. The meaning represented
by a single artificial neuron for example is a form of local representation. The meaning of the entire
network however, is a form of distributed representation due to the many transformations across
neurons and layers.
One thing worth noting is that while ANNs are extremely powerful, they can also be very complex and
are considered black box algorithms, which means that their inner-workings are very difficult to
understand and explain. Choosing whether to employ ANNs to solve problems should therefore be
chosen with that in mind.
Deep learning, while sounding flashy, is really just a term to describe certain types of neural networks
and related algorithms that consume often very raw input data. They process this data through many
layers of nonlinear transformations of the input data in order to calculate a target output.
Unsupervised feature extraction is also an area where deep learning excels. Feature extraction is when
an algorithm is able to automatically derive or construct meaningful features of the data to be used for
further learning, generalization, and understanding. The burden is traditionally on the data scientist or
programmer to carry out the feature extraction process in most other machine learning approaches,
along with feature selection and engineering.
Feature extraction usually involves some amount dimensionality reduction as well, which is reducing
the amount of input features and data required to generate meaningful results. This has many benefits,
which include simplification, computational and memory power reduction, and so on.
More generally, deep learning falls under the group of techniques known as feature
learning or representation learning. As discussed so far, feature extraction is used to ‘learn’ which
features to focus on and use in machine learning solutions. The machine learning algorithms
themselves ‘learn’ the optimal parameters to create the best performing model.
Paraphrasing Wikipedia, feature learning algorithms allow a machine to both learn for a specific task
using a well-suited set of features, and also learn the features themselves. In other words, these
algorithms learn how to learn!
Deep learning has been used successfully in many applications, and is considered to be one of the
most cutting-edge machine learning and AI techniques at the time of this writing. The associated
algorithms are often used for supervised, unsupervised, and semi-supervised learning problems.
For neural network-based deep learning models, the number of layers are greater than in so-
called shallow learning algorithms. Shallow algorithms tend to be less complex and require more up-
front knowledge of optimal features to use, which typically involves feature selection and engineering.
In contrast, deep learning algorithms rely more on optimal model selection and optimization through
model tuning. They are more well suited to solve problems where prior knowledge of features is less
desired or necessary, and where labeled data is unavailable or not required for the primary use case.
In addition to statistical techniques, neural networks and deep learning leverage concepts and
techniques from signal processing as well, including nonlinear processing and/or transformations.
You may recall that a nonlinear function is one that is not characterized simply by a straight line. It
therefore requires more than just a slope to model the relationship between the input, or independent
variable, and the output, or dependent variable. Nonlinear functions can include polynomial, logarithmic,
and exponential terms, as well as any other transformation that isn’t linear.
Many phenomena observed in the physical universe are actually best modeled with nonlinear
transformations. This is true as well for transformations between inputs and the target output in machine
learning and AI solutions.
As mentioned, input data is transformed throughout the layers of a deep learning neural network by
artificial neurons or processing units. The chain of transformations that occur from input to output is
known as the credit assignment path, or CAP.
The CAP value is a proxy for the measurement or concept of ‘depth’ in a deep learning model
architecture. According to Wikipedia, most researchers in the field agree that deep learning has
multiple nonlinear layers with a CAP greater than two, and some consider a CAP greater than ten to
be very deep learning.
While a detailed discussion of the many different deep-learning model architectures and learning
algorithms is beyond the scope of this article, some of the more notable ones include:
Feed-forward neural networks
Recurrent neural network
Multi-layer perceptrons (MLP)
Convolutional neural networks
Recursive neural networks
Deep belief networks
Convolutional deep belief networks
Self-Organizing Maps
Deep Boltzmann machines
Stacked de-noising auto-encoders
It’s worth pointing out that due to the relative increase in complexity, deep learning and neural network
algorithms can be prone to overfitting. In addition, increased model and algorithmic complexity can
result in very significant computational resource and time requirements.
It’s also important to consider that solutions may represent local minima as opposed to a global optimal
solution. This is due to the complex nature of these models when combined with optimization techniques
such as gradient descent.
Given all of this, proper care must be taken when leveraging artificial intelligence algorithms to solve
problems, including the selection, implementation, and performance assessment of algorithms
themselves. While out of scope for this article, the field of machine learning includes many techniques
that can help with these areas.
Cognitive Computing
The goal of cognitive computing is to simulate human thought processes in a computerized model.
Using self-learning algorithms that use data mining, pattern recognition and natural language
processing, the computer can mimic the way the human brain works.
While computers have been faster at calculations and processing than humans for decades, they
haven’t been able to accomplish tasks that humans take for granted as simple, like understanding
natural language, or recognizing unique objects in an image.
Some people say that cognitive computing represents the third era of computing: we went from
computers that could tabulate sums (1900s) to programmable systems (1950s), and now to cognitive
systems.
These cognitive systems, most notably IBM Watson, rely on deep learning algorithms and neural
networks to process information by comparing it to a teaching set of data. The more data the system
is exposed to, the more it learns, and the more accurate it becomes over time, and the neural network
is a complex “tree” of decisions the computer can make to arrive at an answer.
For example, according to a TED Talk video from IBM, Watson could eventually be applied in a
healthcare setting to help collate the span of knowledge around a condition, including patient history,
journal articles, best practices, diagnostic tools, etc., analyze that vast quantity of information, and
provide a recommendation.
The doctor is then able to look at evidence-based treatment options based on a large number of factors
including the individual patient’s presentation and history, to hopefully make better treatment decisions.
In other words, the goal (at this point) is not to replace the doctor, but expand the doctor’s capabilities
by processing the humongous amount of data available that no human could reasonably process and
retain, and provide a summary and potential application.
This sort of process could be done for any field in which large quantities of complex data need to be
processed and analyzed to solve problems, including finance, law, and education.
These systems will also be applied in other areas of business including consumer behavior analysis,
personal shopping bots, customer support bots, travel agents, tutors, security, and diagnostics. Hilton
Hotels recently debuted the first concierge robot, Connie, which can answer questions about the hotel,
local attractions, and restaurants posed to it in natural language.
The personal digital assistants we have on our phones and computers now (Siri and Google among
others) are not true cognitive systems; they have a pre-programmed set of responses and can only
respond to a preset number of requests. But the time is coming in the near future when we will be able
to address our phones, our computers, our cars, or our smart houses and get a real, thoughtful response
rather than a pre-programmed one.
As computers become more able to think like human beings, they will also expand our capabilities and
knowledge. Just as the heroes of science fiction movies rely on their computers to make accurate
predictions, gather data, and draw conclusions, so we will move into an era when computers can
augment human knowledge and ingenuity in entirely new ways.