We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 9
Feature generation
Feature generation is also known as feature construction, feature extraction or feature engineering. There
are different interpretations of the terms feature generation, construction, extraction and engineering,
Some nearly equivalent, yet differing definitions for these terms are; construction of features from raw
data , creating a mapping to convert original features to new features , creating new features from one or
multiple features
‘Two goals of feature generation can be dimensionality reduction and accuracy improvement . When the
goal of a feature generation method is dimensionality reduction, then the result will be a feature space
which contains less features than the original feature space. However, wien the goal is accuracy
improvement, the resulting feature space will most likely contain more features than the original feature
space.
We are primarily interested in feature generation methods where the goal is to improve the accuracy of
the predictor. Dimensionality reduction does not have a high priority, since the results of feature
‘generation are input to a feature selection phase aims to reduce the dimensionality of the feature space.
Even though the feature generation phase does not have to reduce the dimensionality, it certainly has to
take care to not generate an extreme amount of new features.
For example, take as features the price and quality of a product. Separately, they will not give much
indication of whether a product is purchased often. Combined they have a high correlation to the purchase
of the product. If the price is low and the quality high, then the product will be purchased often, However,
a low price or a high quality without knowing the other value cannot guarantee that the product will be
purchased often, If both price and quality are low, then the product will not be purchased by many
‘customers, The same can be said when both price and quality are high.
Feature selection
Feature selection tries to find the optimal subset of a given feature set. The problem of feature selection is
essentially equivalent to the problem of finding the optimal subset of a given set, which has been shown
to be NP-hard. There is one simple method to find the optimal subset, namely calculate the evaluation
score for each possible subset. The advantage of this method is that it will always find the optimal subset.
A disadvantage is that the complexity is 0(2"), where n is the number of features. This means that for 10
features, already 1024 sets have to be evaluated, This disadvantage is even more influential because
feature selection will become more useful the more features are involved.
Since evaluating every subset separately is practically infeasible, there is need for a different, smarter way
to decide the usefulness of each subset.
If the original feature set contains a thousand features, itis highly likely that not all features positively
influence the dependent variable. For example, in text mining it is common practice to remove the stop
words (the, in, a, of, as, and, with) and many more, Although stop word removal can be seen as a part of
the data cleaning step, it can also be seen as a separate feature selection step. Consider every word to be a
feature, then the stop words are features that do not contribute to the prediction of the dependent value.Feature irrelevance is the problem that some features are simply not correlated to the dependent feature,
‘These features can even have a negative effect on the performance of a model
Top reasons to use feature selection are
It enables the machine I
smning algorithm to train faster.
It reduces the complexity of a model and makes it easier to interpret.
It improves the accuracy of a model ifthe right subset is chosen,
It reduces over fitting,
What
stomer Retention?
Customer retention refers to the activities and actions companies and organizations take to reduce the
number of customer defections. The goal of customer retention programs is to help companies retain as
‘many customers as possible, often through customer loyalty and brand loyalty initiatives. It is important
to remember that customer retention begins with the first contact a customer has with a company and
continues throughout the entire lifetime of the relationship,
Most companies today are creating predictive churn models to understand what causes customers to
unsubscribe and how those customers might be better retained. A churn model is simply a predictive
algorithm that identifies the likeliness of a customer to churn. The type of algorithms used are regression
and random forest — will depend on the data available, as well as the company’s business model and
product offering.
3 Steps to retain Customers are
FIRST, YOU NEED A LOT OF CLEAN DATA.
y
IND, YOU MUST CREATE A CLOSED-LOOP PROCESS.
THIRD, YOU MUST ACT ON YOUR FINDINGS.
Dat
Role of Domain Expert
nee
‘The passionate debate nowadays is not whether Data Scientists can deliver business solutions, but rather
whether Domain Experts play @ major role in the delivery of such solutions. On one end of the spectrum,
thought leaders seem to feel that Domain Experts must be involved at every stage of the design,
development and implementation of a Machine Leaning system. On the other end of the spectrum,
KDNugget and Kaggle have repeatedly proved that expert solutions can be built and tested for
performance without the intervention of Domain Experts. The ideal or desirable position is somewhere in
between Data Science vs. Domain Expertise.
Domain experts know their bus
In cases where vast amounts of transactional or process data are stored in word files or excel worksheets,
Domain Experts must be available to interpret the esoteric operational or procedural information, so thatData Scientists can gain better understanding of business processes within particular domains. Without
this deep knowledge or insight of domain operations, Data Scientists cannot provide custom solutions
tailored for highly specific business tasks or operational decisions
Data visualization and working together
Another important function that Domain Experts can play is during data visualization, when data is seen
and interpreted for rate insights. An example of this was found during the study of sensor and
‘maintenance data in an airline fleet. Although no prior model existed, an interpretive analysis of the
results of path analysis led to improved understanding of aircraft safety conditions, which would not have
been possible without sound domain expertise.
‘Two safe conclusions can be drawn from the above discussions on the cross-functional significance of
both Data Science and Domain Expertise in developing robust solutions:
Scientists to ask re
So long as Machine Learning equips De int questions about a domain, the direct
collaboration between Data Scientists and Domain Experts will not only enrich both the parties with new
knowledge, but also strengthen the value of their partnership. It is not Data Science vs. Domain Expertise,
but Data Science and Domain Expertise.
‘Machine Learning offers an alternative mode of learning that requires no prior domain knowledge, thus
easily overcoming domain biases.
Data Scientists with strong Machine Learning skills and an analytical mind can quickly grasp and solve
business problems by exchanging and sharing their acquired domain learning with Domain
different stages of system development. The problem isn’t an either/or issue, but rather requires
parties to come to the table.
‘The issue then for Data Scientists remains an issue with proper skill training in advanced technologies. As
far back as 2011-2012 it was discussed that what the industry needed was not more Data Scientists, but
Data Scientists with access to advanced data technology skills such as Big Data and Machine Learning.
A survey demonstrated that most businesses do not have the skilled manpower to take advantage of
‘cutting-edge data technologies, this problem still exists today for many enterprises, even after the upsurge
in Data Scientists entering the workplace and it’s not likely to go away anytime soon. Thus, modern Data
Scientists have to become more tech savvy and serve as moderators between technologies like Hadoop,
NoSQL, and R and deliver timely data-rich information and insights to business leaders. The Domain
Experts can aid in the visualization and explanation of the insights, but the Data Scientists also need the
ability and training to provide them in a comprehensible manner.
‘The collaborative strength of Data Science and Domain Expertise
Finally, the undisputed fact is that Domain Experts run the daily business; so if Data Scientists succeed in
providing an advanced, data-enabled decision machine to these business experts when they need it and
where they need it, then the Data Scientists have proved their worth. The ideal solution may be to create
templates for standard data inputs for data capture, connect the data tools for seamless analytics activities
and provide excellent visualization platforms like dashboards for quick and effective decision making.‘These template-driven solutions can equip Domain Experts to directly input necessary data and arrive at
results on their own,
‘When Domain Experts have ready-made Machine Learning systems at their disposal, they can select any
standard domain-specific analytics package available in the market to study the data trends and patterns
and gain hidden insights. The Domain Expert’s greatest strength is the ability to identify which questions
need to be answered and the Data Scientists role is to maneuver and leverage advanced data technologies
to build expert systems to answer those questions.
Feature Selection Algorithms
Various methodologies and techniques that you can use to subset your feature space and help your models,
perform better and efficiently
Filter Methods
Set of all Selecting the Learning
Features “Pest subset “P aigorithm =P Performance
Filter methods are generally used as a preprocessing step. The selection of features is independent of any
machine learning algorithms. Instead, features are selected on the basis of their scores in various
statistical tests for their correlation with the outcome variable. The correlation is a subjective term here.
For basic guidance, you can refer to the following table for defining correlation co-efficients.
Feature\Response Continuous Categorical
Continuous Pearson’s Correlation LDA
Categorical ‘Anova Chi Square
Pearson’s Correlation: It is used as a measure for quantifying linear dependence between two
continuous variables X and Y. Its value varies from -1 to +1, Pearson’s correlation is given as:
cov(X,Y)
oxey
pxr
near Discriminant analysis is used to find a linear combination of features that characterizes or
separates two or more classes (or levels) of a categorical variable.
ANOVA: ANOVA stands for Analysis of variance, It is similar to LDA except for the fact that it is
operated using one or more categorical independent features and one continuous dependent feature. It
provides a statistical test of whether the means of several groups are equal or not.
Chi-Square: It is a Statistical test applied to the groups of categorical features to evaluate the likelihood
of correlation or association between them using their frequency distribution‘One thing that should be kept in mind is that filter methods do not remove multi collinearity, So we must
deal with multi collinearity of features as well before training models for your data
Wrappers
Selecting the Best Subset
Set of all
Performance
Features
In wrapper methods, we try to use a subset of features and train a model using them, Based on the
inferences that we draw from the previous model, we decide to add or remove features from your subset.
‘The problem is essentially reduced to a search problem. These methods are usually computationally very
expensive.
Some common examples of wrapper methods are forward feature selection, backward feature elimination,
recursive feature elimination, ete
Forward Selection: Forward selection is an iterative method in which we start with having no feature in
the model. In each iteration, we keep adding the feature which best improves our model till an addition of
anew variable does not improve the performance of the model
Backward Eli
significant feature at each iteration which improves the performance of the model. We repeat this until no
ination: In backward elimination, we start with all the features and removes the least
improvement is observed on removal of features.
Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best
performing feature subset, It repeatedly creates models and keeps aside the best or the worst performing
feature at each iteration. It constructs the next model with the left features until all the features are
exhausted. It then ranks the features based on the order of their elimination.
‘One of the best ways for implementing feature selection with wrapper methods is to use Boruta package
that finds the importance of a feature by creating shadow features.
It works in the following steps:
Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which are
called shadow features).‘Then, it trains a random forest classifier on the extended data set and applies @ feature importance
‘measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher
means more important
At every iteration, it checks whether a real feature has a higher importance than the best of its shadow
features (ie. whether the feature has a higher Z-score than the maximum Z-score of its shadow features)
and constantly removes features which are deemed highly unimportant.
Finally, the algorithm stops either when all features get confirmed or rejected or it reaches a specified
limit of random forest runs.
Decision Trees
‘A decision tree is a largely used non-parametric effective machine learning modeling technique for
regression and classification problems. To find solutions a decision tree makes sequential, hierarchical
decision about the outcome variable based on the predictor data,
Hierarchical means the model is defined by a series of questions that lead to a class label or a value when
applied to any observation. Once set up the model acts like a protocol in a series of “if this occurs then
this occurs” conditions that produce a specific result from the input data.
A Non-parametric method means that there are no underlying assumptions about the distribution of the
errors or the data, It basically means that the model is constructed based on the observed data.
Decision tree models where the target variable uses a discrete set of values are classified as Classification
Tre
In these trees, each node or leaf represent class labels while the branches represent conjunctions of
features leading to class labels. A decision tree where the target variable takes a continuous value, usually
numbers are called Regression Trees. The two types are commonly referred to together at CART
(Classification and Regression Tree),
Each CART model is a case of a Directed Acyclic Graph, These graphs have nodes representing decision
points about the main variable given the predictor and edges are the connections between the nodes.
As the goal of a decision tree is that it makes the optimal choice at the end of each node it needs an
algorithm that is capable of doing just that, That algorithm is known as Hunt's algorithm, which is both
greedy and recursive. Greedy meaning that at step it makes the most optimal decision and recursive
‘meaning it splits the larger question into smaller questions and resolves them the same way. The d
to split at each node is made according to the metric called purity. A node is 100% impure when a node
plit evenly 50/50 and 100% pure when all of its data belongs to a single class.
In order to optimize our model we need to reach maximum purity and avoid impurity. To measure this we
use the Gini impurity which measures how often a randomly chosen element is labeled incorrectly if it
was randomly labeled according to distribution. It is calculated by adding the probability p; of an item
with the label i, being chosen multiplied by the times the probability (I-p)) of a mistake categorizing the
time. Our goal is to have it reach 0 where it will be minimally impure and maximally pure falling into one
category.‘The other metric used is information gain, which is used to decide what feature to split at each step in the
tee
While this is a great model it does present a large problem by resulting in a model that only stops when all
the information is in a single class or attribute. At the expense of bias the variance for this model is
‘massive and will definitely lead to over fitting. “Decision-tree learners can create over-complex trees that
do not generalize well from the training data.” So how do web combat this. We can either set a maximum
depth of the decision tree ie. how many nodes deep it will go and/or an altemative is to specify a
‘minimum number of data points needed to make a split each decision,
What are the other disadvantages does a Decision Tree have: It is locally optimized using a greedy
algorithm where we cannot guarantee a return to the globally optimal decision tree. It is an incredibly
biased model ifa single class takes unless a dataset is balanced before putting it in a tree.
While there are disadvantages there are many advantages to Decision Trees.
They are incredibly simple to understand due to their visual representation, they require very little data,
can handle qualitative and quantitative data, it can be validated using statistical sets, it can handle large
amounts of data and itis quite computationally inexpensive,
Random Forests
Random Forest is a flexible, easy to use machine learning algorithm that produces even without hyper-
parameter tuning, a great result most of the time. It is also one of the most used algorithms because it’s
simplicity and the fact that it can be used for both classification and regression tasks
How it works
Random Forest is a supervised learning algorithm. Like you can already see from its name, it creates a
forest and makes it somehow random. The forest it builds is an e
‘time trained with the “bagging” method. The general idea of the bagging method is that a combination of
learning models increases the overall result
emble of Decision Trees, most of the
To say it in simple words: Random forest builds multiple decision trees and merges them together to get a
‘more accurate and stable prediction,
‘One big advantage of random forest is, that it can be used for both classification and regression problems,
which form the majority of current machine learning systems. I will talk about random forest in
classification, since classification is sometimes considered the building block of machine learning,
Below you can see how a random forest would look like with two treesFeature(/)
AL
Tree ty
K>
PCCIAY = > Palelf)
With a few exceptions a random-forest classifier has all the hyper parameters of a decision-tree classifier
and also all the hyper parameters of a bagging classifier, to control the ensemble itself. Instead of building
a bagging-classifier and passing it into a decision-tree-classifier, you can just use the random-forest
classifier class, which is more convenient and optimized for decision trees. Note that there is also a
random-forest regressor for regression tasks.
‘The random-forest algorithm brings extra randomness into the model, when it is growing the trees,
Instead of searching for the best feature while splitting a node, it searches for the best feature among a
random subset of features. This process creates a wide diversity, which generally results in a better model.
‘Therefore when you are growing a tree in random forest, only @ random subset of the features is
‘considered for splitting a node. You can even make trees more random by using random thresholds on top
of it, for each feature rather than searching for the best possible thresholds (like a normal decision tree
does),
Another great quality of the random forest algorithm is that it is very easy to measure the relative
importance of each feature on the prediction
‘Through looking at the feature importance you can decide which features you may want to drop because
they don’t contribute enough or nothing to the prediction process. This is important because a general rule
in machine learning is that the more features you have, the more likely your model will suffer from over
fitting and vice versa
Advances in Data Science and Analytics Concepts and Paradigms - Advances in Data Science and Analytics Concepts and Paradigms (M. Niranjanamurthy, Hemant Kumar Gianey Etc.)
Big Data 4 Manuscripts – Data Analytics for Beginners, Deep Learning with Keras, Analyzing Data with Power BI, Convolutional... (Williams, Anthony) (Z-Library)