0% found this document useful (0 votes)
24 views

FDS Unit V

Uploaded by

Krishna Kalyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
24 views

FDS Unit V

Uploaded by

Krishna Kalyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 9
Feature generation Feature generation is also known as feature construction, feature extraction or feature engineering. There are different interpretations of the terms feature generation, construction, extraction and engineering, Some nearly equivalent, yet differing definitions for these terms are; construction of features from raw data , creating a mapping to convert original features to new features , creating new features from one or multiple features ‘Two goals of feature generation can be dimensionality reduction and accuracy improvement . When the goal of a feature generation method is dimensionality reduction, then the result will be a feature space which contains less features than the original feature space. However, wien the goal is accuracy improvement, the resulting feature space will most likely contain more features than the original feature space. We are primarily interested in feature generation methods where the goal is to improve the accuracy of the predictor. Dimensionality reduction does not have a high priority, since the results of feature ‘generation are input to a feature selection phase aims to reduce the dimensionality of the feature space. Even though the feature generation phase does not have to reduce the dimensionality, it certainly has to take care to not generate an extreme amount of new features. For example, take as features the price and quality of a product. Separately, they will not give much indication of whether a product is purchased often. Combined they have a high correlation to the purchase of the product. If the price is low and the quality high, then the product will be purchased often, However, a low price or a high quality without knowing the other value cannot guarantee that the product will be purchased often, If both price and quality are low, then the product will not be purchased by many ‘customers, The same can be said when both price and quality are high. Feature selection Feature selection tries to find the optimal subset of a given feature set. The problem of feature selection is essentially equivalent to the problem of finding the optimal subset of a given set, which has been shown to be NP-hard. There is one simple method to find the optimal subset, namely calculate the evaluation score for each possible subset. The advantage of this method is that it will always find the optimal subset. A disadvantage is that the complexity is 0(2"), where n is the number of features. This means that for 10 features, already 1024 sets have to be evaluated, This disadvantage is even more influential because feature selection will become more useful the more features are involved. Since evaluating every subset separately is practically infeasible, there is need for a different, smarter way to decide the usefulness of each subset. If the original feature set contains a thousand features, itis highly likely that not all features positively influence the dependent variable. For example, in text mining it is common practice to remove the stop words (the, in, a, of, as, and, with) and many more, Although stop word removal can be seen as a part of the data cleaning step, it can also be seen as a separate feature selection step. Consider every word to be a feature, then the stop words are features that do not contribute to the prediction of the dependent value. Feature irrelevance is the problem that some features are simply not correlated to the dependent feature, ‘These features can even have a negative effect on the performance of a model Top reasons to use feature selection are It enables the machine I smning algorithm to train faster. It reduces the complexity of a model and makes it easier to interpret. It improves the accuracy of a model ifthe right subset is chosen, It reduces over fitting, What stomer Retention? Customer retention refers to the activities and actions companies and organizations take to reduce the number of customer defections. The goal of customer retention programs is to help companies retain as ‘many customers as possible, often through customer loyalty and brand loyalty initiatives. It is important to remember that customer retention begins with the first contact a customer has with a company and continues throughout the entire lifetime of the relationship, Most companies today are creating predictive churn models to understand what causes customers to unsubscribe and how those customers might be better retained. A churn model is simply a predictive algorithm that identifies the likeliness of a customer to churn. The type of algorithms used are regression and random forest — will depend on the data available, as well as the company’s business model and product offering. 3 Steps to retain Customers are FIRST, YOU NEED A LOT OF CLEAN DATA. y IND, YOU MUST CREATE A CLOSED-LOOP PROCESS. THIRD, YOU MUST ACT ON YOUR FINDINGS. Dat Role of Domain Expert nee ‘The passionate debate nowadays is not whether Data Scientists can deliver business solutions, but rather whether Domain Experts play @ major role in the delivery of such solutions. On one end of the spectrum, thought leaders seem to feel that Domain Experts must be involved at every stage of the design, development and implementation of a Machine Leaning system. On the other end of the spectrum, KDNugget and Kaggle have repeatedly proved that expert solutions can be built and tested for performance without the intervention of Domain Experts. The ideal or desirable position is somewhere in between Data Science vs. Domain Expertise. Domain experts know their bus In cases where vast amounts of transactional or process data are stored in word files or excel worksheets, Domain Experts must be available to interpret the esoteric operational or procedural information, so that Data Scientists can gain better understanding of business processes within particular domains. Without this deep knowledge or insight of domain operations, Data Scientists cannot provide custom solutions tailored for highly specific business tasks or operational decisions Data visualization and working together Another important function that Domain Experts can play is during data visualization, when data is seen and interpreted for rate insights. An example of this was found during the study of sensor and ‘maintenance data in an airline fleet. Although no prior model existed, an interpretive analysis of the results of path analysis led to improved understanding of aircraft safety conditions, which would not have been possible without sound domain expertise. ‘Two safe conclusions can be drawn from the above discussions on the cross-functional significance of both Data Science and Domain Expertise in developing robust solutions: Scientists to ask re So long as Machine Learning equips De int questions about a domain, the direct collaboration between Data Scientists and Domain Experts will not only enrich both the parties with new knowledge, but also strengthen the value of their partnership. It is not Data Science vs. Domain Expertise, but Data Science and Domain Expertise. ‘Machine Learning offers an alternative mode of learning that requires no prior domain knowledge, thus easily overcoming domain biases. Data Scientists with strong Machine Learning skills and an analytical mind can quickly grasp and solve business problems by exchanging and sharing their acquired domain learning with Domain different stages of system development. The problem isn’t an either/or issue, but rather requires parties to come to the table. ‘The issue then for Data Scientists remains an issue with proper skill training in advanced technologies. As far back as 2011-2012 it was discussed that what the industry needed was not more Data Scientists, but Data Scientists with access to advanced data technology skills such as Big Data and Machine Learning. A survey demonstrated that most businesses do not have the skilled manpower to take advantage of ‘cutting-edge data technologies, this problem still exists today for many enterprises, even after the upsurge in Data Scientists entering the workplace and it’s not likely to go away anytime soon. Thus, modern Data Scientists have to become more tech savvy and serve as moderators between technologies like Hadoop, NoSQL, and R and deliver timely data-rich information and insights to business leaders. The Domain Experts can aid in the visualization and explanation of the insights, but the Data Scientists also need the ability and training to provide them in a comprehensible manner. ‘The collaborative strength of Data Science and Domain Expertise Finally, the undisputed fact is that Domain Experts run the daily business; so if Data Scientists succeed in providing an advanced, data-enabled decision machine to these business experts when they need it and where they need it, then the Data Scientists have proved their worth. The ideal solution may be to create templates for standard data inputs for data capture, connect the data tools for seamless analytics activities and provide excellent visualization platforms like dashboards for quick and effective decision making. ‘These template-driven solutions can equip Domain Experts to directly input necessary data and arrive at results on their own, ‘When Domain Experts have ready-made Machine Learning systems at their disposal, they can select any standard domain-specific analytics package available in the market to study the data trends and patterns and gain hidden insights. The Domain Expert’s greatest strength is the ability to identify which questions need to be answered and the Data Scientists role is to maneuver and leverage advanced data technologies to build expert systems to answer those questions. Feature Selection Algorithms Various methodologies and techniques that you can use to subset your feature space and help your models, perform better and efficiently Filter Methods Set of all Selecting the Learning Features “Pest subset “P aigorithm =P Performance Filter methods are generally used as a preprocessing step. The selection of features is independent of any machine learning algorithms. Instead, features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. The correlation is a subjective term here. For basic guidance, you can refer to the following table for defining correlation co-efficients. Feature\Response Continuous Categorical Continuous Pearson’s Correlation LDA Categorical ‘Anova Chi Square Pearson’s Correlation: It is used as a measure for quantifying linear dependence between two continuous variables X and Y. Its value varies from -1 to +1, Pearson’s correlation is given as: cov(X,Y) oxey pxr near Discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable. ANOVA: ANOVA stands for Analysis of variance, It is similar to LDA except for the fact that it is operated using one or more categorical independent features and one continuous dependent feature. It provides a statistical test of whether the means of several groups are equal or not. Chi-Square: It is a Statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution ‘One thing that should be kept in mind is that filter methods do not remove multi collinearity, So we must deal with multi collinearity of features as well before training models for your data Wrappers Selecting the Best Subset Set of all Performance Features In wrapper methods, we try to use a subset of features and train a model using them, Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. ‘The problem is essentially reduced to a search problem. These methods are usually computationally very expensive. Some common examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, ete Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of anew variable does not improve the performance of the model Backward Eli significant feature at each iteration which improves the performance of the model. We repeat this until no ination: In backward elimination, we start with all the features and removes the least improvement is observed on removal of features. Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing feature subset, It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination. ‘One of the best ways for implementing feature selection with wrapper methods is to use Boruta package that finds the importance of a feature by creating shadow features. It works in the following steps: Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which are called shadow features). ‘Then, it trains a random forest classifier on the extended data set and applies @ feature importance ‘measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher means more important At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features (ie. whether the feature has a higher Z-score than the maximum Z-score of its shadow features) and constantly removes features which are deemed highly unimportant. Finally, the algorithm stops either when all features get confirmed or rejected or it reaches a specified limit of random forest runs. Decision Trees ‘A decision tree is a largely used non-parametric effective machine learning modeling technique for regression and classification problems. To find solutions a decision tree makes sequential, hierarchical decision about the outcome variable based on the predictor data, Hierarchical means the model is defined by a series of questions that lead to a class label or a value when applied to any observation. Once set up the model acts like a protocol in a series of “if this occurs then this occurs” conditions that produce a specific result from the input data. A Non-parametric method means that there are no underlying assumptions about the distribution of the errors or the data, It basically means that the model is constructed based on the observed data. Decision tree models where the target variable uses a discrete set of values are classified as Classification Tre In these trees, each node or leaf represent class labels while the branches represent conjunctions of features leading to class labels. A decision tree where the target variable takes a continuous value, usually numbers are called Regression Trees. The two types are commonly referred to together at CART (Classification and Regression Tree), Each CART model is a case of a Directed Acyclic Graph, These graphs have nodes representing decision points about the main variable given the predictor and edges are the connections between the nodes. As the goal of a decision tree is that it makes the optimal choice at the end of each node it needs an algorithm that is capable of doing just that, That algorithm is known as Hunt's algorithm, which is both greedy and recursive. Greedy meaning that at step it makes the most optimal decision and recursive ‘meaning it splits the larger question into smaller questions and resolves them the same way. The d to split at each node is made according to the metric called purity. A node is 100% impure when a node plit evenly 50/50 and 100% pure when all of its data belongs to a single class. In order to optimize our model we need to reach maximum purity and avoid impurity. To measure this we use the Gini impurity which measures how often a randomly chosen element is labeled incorrectly if it was randomly labeled according to distribution. It is calculated by adding the probability p; of an item with the label i, being chosen multiplied by the times the probability (I-p)) of a mistake categorizing the time. Our goal is to have it reach 0 where it will be minimally impure and maximally pure falling into one category. ‘The other metric used is information gain, which is used to decide what feature to split at each step in the tee While this is a great model it does present a large problem by resulting in a model that only stops when all the information is in a single class or attribute. At the expense of bias the variance for this model is ‘massive and will definitely lead to over fitting. “Decision-tree learners can create over-complex trees that do not generalize well from the training data.” So how do web combat this. We can either set a maximum depth of the decision tree ie. how many nodes deep it will go and/or an altemative is to specify a ‘minimum number of data points needed to make a split each decision, What are the other disadvantages does a Decision Tree have: It is locally optimized using a greedy algorithm where we cannot guarantee a return to the globally optimal decision tree. It is an incredibly biased model ifa single class takes unless a dataset is balanced before putting it in a tree. While there are disadvantages there are many advantages to Decision Trees. They are incredibly simple to understand due to their visual representation, they require very little data, can handle qualitative and quantitative data, it can be validated using statistical sets, it can handle large amounts of data and itis quite computationally inexpensive, Random Forests Random Forest is a flexible, easy to use machine learning algorithm that produces even without hyper- parameter tuning, a great result most of the time. It is also one of the most used algorithms because it’s simplicity and the fact that it can be used for both classification and regression tasks How it works Random Forest is a supervised learning algorithm. Like you can already see from its name, it creates a forest and makes it somehow random. The forest it builds is an e ‘time trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result emble of Decision Trees, most of the To say it in simple words: Random forest builds multiple decision trees and merges them together to get a ‘more accurate and stable prediction, ‘One big advantage of random forest is, that it can be used for both classification and regression problems, which form the majority of current machine learning systems. I will talk about random forest in classification, since classification is sometimes considered the building block of machine learning, Below you can see how a random forest would look like with two trees Feature(/) AL Tree ty K> PCCIAY = > Palelf) With a few exceptions a random-forest classifier has all the hyper parameters of a decision-tree classifier and also all the hyper parameters of a bagging classifier, to control the ensemble itself. Instead of building a bagging-classifier and passing it into a decision-tree-classifier, you can just use the random-forest classifier class, which is more convenient and optimized for decision trees. Note that there is also a random-forest regressor for regression tasks. ‘The random-forest algorithm brings extra randomness into the model, when it is growing the trees, Instead of searching for the best feature while splitting a node, it searches for the best feature among a random subset of features. This process creates a wide diversity, which generally results in a better model. ‘Therefore when you are growing a tree in random forest, only @ random subset of the features is ‘considered for splitting a node. You can even make trees more random by using random thresholds on top of it, for each feature rather than searching for the best possible thresholds (like a normal decision tree does), Another great quality of the random forest algorithm is that it is very easy to measure the relative importance of each feature on the prediction ‘Through looking at the feature importance you can decide which features you may want to drop because they don’t contribute enough or nothing to the prediction process. This is important because a general rule in machine learning is that the more features you have, the more likely your model will suffer from over fitting and vice versa

You might also like