0% found this document useful (0 votes)
125 views22 pages

Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch

This document discusses an approach and framework for applying machine learning models to problems. It begins by discussing preparing data in a tabular format and identifying the type of machine learning problem (classification, regression, etc). It then covers splitting data into training and validation sets, identifying variable types (numeric, categorical, text), preprocessing techniques for each type, and applying machine learning models to the preprocessed data. The framework shown involves identifying the problem, preprocessing data, building models, and evaluating on a validation set to avoid overfitting.

Uploaded by

pepe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views22 pages

Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch

This document discusses an approach and framework for applying machine learning models to problems. It begins by discussing preparing data in a tabular format and identifying the type of machine learning problem (classification, regression, etc). It then covers splitting data into training and validation sets, identifying variable types (numeric, categorical, text), preprocessing techniques for each type, and applying machine learning models to the preprocessed data. The framework shown involves identifying the problem, preprocessing data, building models, and evaluating on a validation set to avoid overfitting.

Uploaded by

pepe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

No Free Hunch
(http://blog.kaggle.com/)

 (HTTP://BLOG.KAGGLE.COM)  APPROACHING (ALMOST) ANY MACHINE LEARNING PROBLEM | ABHISHEK
THAKUR
 (HTTP://BLOG.KAGGLE.COM/2016/07/27/AVITO-DUPLICATE-ADS-DETECTION-WINNERS-INTERVIEW-3RD-
PLACE-MARIO-GERARD-KELE-PRAVEEN-GILBERTO/)  (HTTP://BLOG.KAGGLE.COM/2016/07/14/KAGGLE-
MASTER-DATA-SCIENTIST-AUTHOR-AN-INTERVIEW-WITH-LUCA-MASSARON/)

44
Approaching (Almost) Any Machine Learning (http://b
Problem | Abhishek Thakur
Kaggle Team (http://blog.kaggle.com/author/kaggleteam/) | 07.21.2016 almost-

any-

machine

learning

problem

abhishe

thakur/

Abhishek Thakur (https://www.kaggle.com/abhishek), a Kaggle Grandmaster, originally published this post


here (https://www.linkedin.com/pulse/approaching-almost-any-machine-learning-problem-abhishek-
thakur) on July 18th, 2016 and kindly gave us permission to cross-post on No Free Hunch

An average data scientist deals with loads of data daily. Some say over 60-70% time is spent in data
cleaning, munging and bringing data to a suitable format such that machine learning models can be
applied on that data. This post focuses on the second part, i.e., applying machine learning models,
including the preprocessing steps. The pipelines discussed in this post come as a result of over a
hundred machine learning competitions that I’ve taken part in. It must be noted that the discussion
here is very general but very useful and there can also be very complicated methods which exist and
are practised by professionals.

We will be using python!

Data
Before applying the machine learning models, the data must be converted to a tabular form. This
whole process is the most time consuming and di cult process and is depicted in the gure below.
(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_1.png)

The machine learning models are then applied to the tabular data. Tabular data is most common
way of representing data in machine learning or data mining. We have a data table, rows with
di erent samples of the data or X and labels, y. The labels can be single column or multi-column,
depending on the type of problem. We will denote data by X and labels by y.

Types of labels
The labels de ne the problem and can be of di erent types, such as:

Single column, binary values (classi cation problem, one sample belongs to one class only and
there are only two classes)
Single column, real values (regression problem, prediction of only one value)
Multiple column, binary values (classi cation problem, one sample belongs to one class, but
there are more than two classes)
Multiple column, real values (regression problem, prediction of multiple values)
And multilabel (classi cation problem, one sample can belong to several classes)

Evaluation Metrics
For any kind of machine learning problem, we must know how we are going to evaluate our results,
or what the evaluation metric or objective is. For example in case of a skewed binary classi cation
problem we generally choose area under the receiver operating characteristic curve (ROC AUC or
simply AUC). In case of multi-label or multi-class classi cation problems, we generally choose
categorical cross-entropy or multiclass log loss and mean squared error in case of regression
problems.

I won’t go into details of the di erent evaluation metrics as we can have many di erent types,
depending on the problem.

The Libraries
To start with the machine learning libraries, install the basic and most important ones rst, for
example, numpy and scipy.

To see and do operations on data: pandas (http://pandas.pydata.org/ (http://pandas.pydata.org/))


For all kinds of machine learning models: scikit-learn (http://scikit-learn.org/stable/ (http://scikit-
learn.org/stable/))
The best gradient boosting library: xgboost (https://github.com/dmlc/xgboost
(https://github.com/dmlc/xgboost))
For neural networks: keras (http://keras.io/ (http://keras.io/))
For plotting data: matplotlib (http://matplotlib.org/ (http://matplotlib.org/))
To monitor progress: tqdm (https://pypi.python.org/pypi/tqdm
(https://pypi.python.org/pypi/tqdm))

I don’t use Anaconda (https://www.continuum.io/downloads


(https://www.continuum.io/downloads)). It’s easy and does everything for you, but I want more
freedom. The choice is yours.

The Machine Learning Framework


In 2015, I came up with a framework for automatic machine learning which is still under
development and will be released soon. For this post, the same framework will be the basis. The
framework is shown in the gure below:

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_2.png)
FIGURE FROM: A. THAKUR AND A. KROHN-GRIMBERGHE, AUTOCOMPETE: A FRAMEWORK FOR MACHINE LEARNING
COMPETITIONS, AUTOML WORKSHOP, INTERNATIONAL CONFERENCE ON MACHINE LEARNING 2015.
In the framework shown above, the pink lines represent the most common paths followed. After we
have extracted and reduced the data to a tabular format, we can go ahead with building machine
learning models.

The very rst step is identi cation of the problem. This can be done by looking at the labels. One
must know if the problem is a binary classi cation, a multi-class or multi-label classi cation or a
regression problem. After we have identi ed the problem, we split the data into two di erent parts,
a training set and a validation set as depicted in the gure below.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_3.png)

The splitting of data into training and validation sets “must” be done according to labels. In case of
any kind of classi cation problem, use strati ed splitting. In python, you can do this using scikit-learn
very easily.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_4.png)

In case of regression task, a simple K-Fold splitting should su ce. There are, however, some complex
methods which tend to keep the distribution of labels same for both training and validation set and
this is left as an exercise for the reader.
(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_5.png)

I have chosen the eval_size or the size of the validation set as 10% of the full data in the examples
above, but one can choose this value according to the size of the data they have.

After the splitting of the data is done, leave this data out and don’t touch it. Any operations that are
applied on training set must be saved and then applied to the validation set. Validation set, in any
case, should not be joined with the training set. Doing so will result in very good evaluation scores
and make the user happy but instead he/she will be building a useless model with very high
over tting.

Next step is identi cation of di erent variables in the data. There are usually three types of variables
we deal with. Namely, numerical variables, categorical variables and variables with text inside them.
Let’s take example of the popular Titanic dataset (https://www.kaggle.com/c/titanic/data
(https://www.kaggle.com/c/titanic/data)).

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_6.png)

Here, survival is the label. We have already separated labels from the training data in the previous
step. Then, we have pclass, sex, embarked. These variables have di erent levels and thus they are
categorical variables. Variables like age, sibsp, parch, etc are numerical variables. Name is a variable
with text data but I don’t think it’s a useful variable to predict survival.
Separate out the numerical variables rst. These variables don’t need any kind of processing and
thus we can start applying normalization and machine learning models to these variables.

There are two ways in which we can handle categorical data:

Convert the categorical data to labels

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_7.png)

Convert the labels to binary variables (one-hot encoding)

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_8.png)

Please remember to convert categories to numbers rst using LabelEncoder before applying
OneHotEncoder on it.

Since, the Titanic data doesn’t have good example of text variables, let’s formulate a general rule on
handling text variables. We can combine all the text variables into one and then use some algorithms
which work on text data and convert it to numbers.

The text variables can be joined as follows:

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_9.png)

We can then use CountVectorizer or T dfVectorizer on it:

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_10.png)
or,

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_11.png)

The T dfVectorizer performs better than the counts most of the time and I have seen that the
following parameters for T dfVectorizer work almost all the time.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_12.png)

If you are applying these vectorizers only on the training set, make sure to dump it to hard drive so
that you can use it later on the validation set.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_13.png)

Next, we come to the stacker module. Stacker module is not a model stacker but a feature stacker.
The di erent features after the processing steps described above can be combined using the stacker
module.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_14.png)

You can horizontally stack all the features before putting them through further processing by using
numpy hstack or sparse hstack depending on whether you have dense or sparse features.
(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_15.png)

And can also be achieved by FeatureUnion module in case there are other processing steps such as
pca or feature selection (we will visit decomposition and feature selection later in this post).

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_16.png)

Once, we have stacked the features together, we can start applying machine learning models. At this
stage only models you should go for should be ensemble tree based models. These models include:

RandomForestClassi er
RandomForestRegressor
ExtraTreesClassi er
ExtraTreesRegressor
XGBClassi er
XGBRegressor

We cannot apply linear models to the above features since they are not normalized. To use linear
models, one can use Normalizer or StandardScaler from scikit-learn.

These normalization methods work only on dense features and don’t give very good results if applied
on sparse features. Yes, one can apply StandardScaler on sparse matrices without using the mean
(parameter: with_mean=False).

If the above steps give a “good” model, we can go for optimization of hyperparameters and in case it
doesn’t we can go for the following steps and improve our model.

The next steps include decomposition methods:


(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_17.png)

For the sake of simplicity, we will leave out LDA and QDA transformations. For high dimensional data,
generally PCA is used decompose the data. For images start with 10-15 components and increase
this number as long as the quality of result improves substantially. For other type of data, we select
50-60 components initially (we tend to avoid PCA as long as we can deal with the numerical data as it
is).

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_18.png)

For text data, after conversion of text to sparse matrix, go for Singular Value Decomposition (SVD). A
variation of SVD called TruncatedSVD can be found in scikit-learn.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_decomp.png)

The number of SVD components that generally work for TF-IDF or counts are between 120-200. Any
number above this might improve the performance but not substantially and comes at the cost of
computing power.

After evaluating further performance of the models, we move to scaling of the datasets, so that we
can evaluate linear models too. The normalized or scaled features can then be sent to the machine
learning models or feature selection modules.
(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_19.png)

There are multiple ways in which feature selection can be achieved. One of the most common way is
greedy feature selection (forward or backward). In greedy feature selection we choose one feature,
train a model and evaluate the performance of the model on a xed evaluation metric. We keep
adding and removing features one-by-one and record performance of the model at every step. We
then select the features which have the best evaluation score. One implementation of greedy feature
selection with AUC as evaluation metric can be found here:
https://github.com/abhishekkrthakur/greedyFeatureSelection
(https://github.com/abhishekkrthakur/greedyFeatureSelection). It must be noted that this
implementation is not perfect and must be changed/modi ed according to the requirements.

Other faster methods of feature selection include selecting best features from a model. We can
either look at coe cients of a logit model or we can train a random forest to select best features and
then use them later with other machine learning models.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_20.png)

Remember to keep low number of estimators and minimal optimization of hyper parameters so that
you don’t over t.

The feature selection can also be achieved using Gradient Boosting Machines. It is good if we use
xgboost instead of the implementation of GBM in scikit-learn since xgboost is much faster and more
scalable.
(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_21.png)

We can also do feature selection of sparse datasets using RandomForestClassi er /


RandomForestRegressor and xgboost.

Another popular method for feature selection from positive sparse datasets is chi-2 based feature
selection and we also have that implemented in scikit-learn.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_22.png)

Here, we use chi2 in conjunction with SelectKBest to select 20 features from the data. This also
becomes a hyperparameter we want to optimize to improve the result of our machine learning
models.

Don’t forget to dump any kinds of transformers you use at all the steps. You will need them to
evaluate performance on the validation set.

Next (or intermediate) major step is model selection + hyperparameter optimization.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_23.png)

We generally use the following algorithms in the process of selecting a machine learning model:
Classi cation:
Random Forest
GBM
Logistic Regression
Naive Bayes
Support Vector Machines
k-Nearest Neighbors

Regression
Random Forest
GBM
Linear Regression
Ridge
Lasso
SVR

Which parameters should I optimize? How do I choose parameters closest to the best ones? These
are a couple of questions people come up with most of the time. One cannot get answers to these
questions without experience with di erent models + parameters on a large number of datasets.
Also people who have experience are not willing to share their secrets. Luckily, I have quite a bit of
experience too and I’m willing to give away some of the stu .

Let’s break down the hyperparameters, model wise:


(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_24.png)

RS* = Cannot say about proper values, go for Random Search in these hyperparameters.

In my opinion, and strictly my opinion, the above models will out-perform any others and we don’t
need to evaluate any other models.

Once again, remember to save the transformers:


(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_25.png)

And apply them on validation set separately:

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_26.png)

The above rules and the framework has performed very well in most of the datasets I have dealt
with. Of course, it has also failed for very complicated tasks. Nothing is perfect and we keep on
improving on what we learn. Just like in machine learning.

Get in touch with me with any doubts: abhishek4 [at] gmail [dot] com

Bio
Abhishek Thakur (https://www.kaggle.com/abhishek) works as a Senior Data Scientist on the
Data Science team at Searchmetrics Inc (http://www.searchmetrics.com/). At Searchmetrics,
Abhishek works on some of the most interesting data driven studies, applied machine learning
algorithms and deriving insights from huge amount of data which require a lot of data munging,
cleaning, feature engineering and building and optimization of machine learning models.

In his free time, he likes to take part in machine learning competitions and has taken part in over 100
competitions. His research interests include automatic machine learning, deep learning,
hyperparameter optimization, computer vision, image analysis and retrieval and pattern recognition.
(http://5047-
presscdn.pagely.netdna-
cdn.com/wp-
content/uploads/2016/07/abhishek.png)
ABHISHEK THAKUR
(HTTPS://WWW.KAGGLE.COM/ABHISHEK),
COMPETITIONS GRANDMASTER.

PYTHON (HTTP://BLOG.KAGGLE.COM/TAG/PYTHON/)

TUTORIAL (HTTP://BLOG.KAGGLE.COM/TAG/TUTORIAL/)

44 Comments No Free Hunch 


1 Login

Sort by Best
 Recommend 72 ⤤ Share

Join the discussion…

LOG IN WITH
OR SIGN UP WITH DISQUS ?

Name

Bojan Tunguz • a year ago


Great post! This is one of the most useful writeups about the general approach to ML that
I've come across.
6△ ▽ • Reply • Share ›

stuti awasthi • a year ago


Exceptionally written post. Thanks for sharing hyper params optimization generalized
rules. Definetly a good read.
Thanks
4△ ▽ • Reply • Share ›

Bruce Robbins • a year ago


I agree with the other commentators on the value of this post, if for nothing else, for
confirming the “The dirty little secret of big data,” being the fact , “that most data analysts
spend the vast majority of their time cleaning and integrating data — not actually analysing
it.” At a practical level the overview on hyperparameters optimisation and evaluation is
very useful.
4△ ▽ • Reply • Share ›

Dhiraj Tandon • a year ago


Awsome post Abhishek
4△ ▽ • Reply • Share ›

Ayush Singh • a year ago


Thanks for sharing this informative post!
I have a question abt how to go about choosing the statistical test i.e Chi2 ,ANOVA etc for
SelectKbest in sklearn.I have read that Chi is used to see correlation btw Categorical vs
Categorical variables and ANOVA for Categorical vs Continuous variables.I see here that
you have used Chi2 test for k=20 but there are some features that are continuous like
Age,Fare with target variable as Categorical so I am a bit confused.
3△ ▽ • Reply • Share ›

jdunn > Ayush Singh • a year ago


Must learn anova
△ ▽ • Reply • Share ›

abhi • a year ago


Thanks all!
2△ ▽ • Reply • Share ›

PANKAJ • a year ago


Fantastic post..from the vast jungle of possibilities this is one way to get to results!
2△ ▽ • Reply • Share ›

kevinzakka • a year ago


Great writeup, and couldn't have come at a better time. Thanks for sharing!
2△ ▽ • Reply • Share ›

Andy • 3 months ago


What do you mean by "you want more freedom"? How does anaconda give you less
freedom? Conda allows you to have different dynamically linked C libraries for different
python environments, that seems "more freedom" to me. You can still pip install things that
are not available on conda - or you can use conda-forge or other channels.
1△ ▽ • Reply • Share ›

Mohamed Rachidi • a year ago


Very good post, Thank you
1△ ▽ • Reply • Share ›

Yurii Shevchuk • a year ago


Thank you for the interesting post!
I want to clarify one thing. You've mentioned that

> We cannot apply linear models to the above features since they are not normalized.
Is it necessary to apply StandardScaler to the data when you try to fit a linear model?
Least Squares should work fine without feature scaling assuming that scale of variables is
not to big. Otherwise it can cause computational problems.
1△ ▽ • Reply • Share ›

Steve Joseph • a year ago


Thanks. I'm curious where neural networks/deep learning fits into this model?
1△ ▽ • Reply • Share ›

abhi > Steve Joseph • a year ago


I havent discussed neural nets in this post. But will soon write about parameter
tuning and network architecture selection for neural nets.
△ ▽ • Reply • Share ›

Sheshachalam Ratnala > abhi • a year ago


I am also a bit curious how it applies to Unsupervised Problem. The post
seems to be looking at supervised problem only
2△ ▽ • Reply • Share ›

StartupFlux • a year ago


Great post! Easy to understand and very informative. Thanks for Sharing!
1△ ▽ • Reply • Share ›

sampath kumar • a year ago


Great Post & Thank you.
1△ ▽ • Reply • Share ›

Trần Đức Nhuận • a year ago


Amazing great article guidelines. Thanks Abhishek
1△ ▽ • Reply • Share ›

Meghna Natraj • 20 days ago


Probably the best Machine learning framework post ever! :)
△ ▽ • Reply • Share ›

Gavin Zhu • a month ago


Can you give an example about how to save transformations?
△ ▽ • Reply • Share ›

Ajay Tanpure • 2 months ago


Great post.
I have one question. Is it possible to retrain the model with new data.
Here is the detailed explaination:
https://stackoverflow.com/q...
△ ▽ • Reply • Share ›

Neeladree Chakravorty • 2 months ago


Thanks for the compilation Abhishek! I would like to discuss with you or other Data
Scientists here about how to decide on our choice for out-of-time (OOT) validation set. Let
me know if someone would like to share their experience on choosing OOT test-set from a
me know if someone would like to share their experience on choosing OOT test-set from a
pool of data for testing predictive models. Thanks!
△ ▽ • Reply • Share ›

Belal Chaudhary • 6 months ago


Awesome post Abhishek - thanks for the insights!
△ ▽ • Reply • Share ›

jdunn • 6 months ago


@abhi When you say the following "In my opinion, and strictly my opinion, the above
models will out-perform any others and we don’t need to evaluate any other models.", how
do you include NNs and FM/FFM?
Can I assume you are not including recommenders at all and so can discount FM/FFM?
In that case how you include things such as Keras models?
△ ▽ • Reply • Share ›

Diego • 6 months ago


I think min_samples_split in random forest must be at least 2 or in (0, 1]
△ ▽ • Reply • Share ›

Robert • 6 months ago


@abhi Do you see potential for your Autocompete to become a core element of an
artificial general intelligence? Like a missing piece that would allow an AI to approach data
from *any* field with the same core algorithm to continuously produce new knowledge.
(Sorry if you get this question a lot)

My first instinct in solving problems is usually "How can I generalize it so that I will never
need to do that again myself?". I started learning ML just recently, and the first thing I did
was write a little data preprocessor and learned to use numpy and pandas in the process.
So maybe I'll also expand my preprocessor into an "autocompeter" just to understand
every step of the flow better =)
△ ▽ • Reply • Share ›

Robert > Robert • 6 months ago


@abhi also in your article when you talk about vectorizing text data, you use fitting
on the test data, which creates a vocabulary different from the training data. If I'm
not mistaken, you should only use transform on the test/validation data without
fitting to it so that you would only count the frequencies of words that existed in the
training set and not words that are introduced in the test set for the first time.
1△ ▽ • Reply • Share ›

Khin hay man • 7 months ago


HI bro abhishek. I am Khin, I am doing master in Computer science. now i am in the stage
of doing experimental. I am doing natural language processing using python based on
training and testing.In my training data, i applied some rules , based on the rules, my
testing data will label ambiguous or not by using Naive Bayes text classification. Now, i got
my output, but it is not accurate because it does not applied rules. I think i need to do
feature extraction by writing my own code. I do not know how to do that. I hope u can give
me some suggestion about my experimental.

Thanking u in an advanced
Regards, khin
△ ▽ • Reply • Share ›

Jinesh Shah • 7 months ago


Thanks for sharing your knowledge and approach with us!
△ ▽ • Reply • Share ›

Ayub Quadri • 8 months ago


One of the best post i have ever read, which cover

1. Various types of Label (Binary class classification, Muliti class classification,


Regression, multilabel)
2. Evaluation Metric understanding (AUC, RMSE, Cross Entropy)
3. Type of data (Categorical, Continuous, Text)
4. Feature pre-processing & selection(Numerical, categorical & Text)
5. Splitting data(test train) with k fold validation
6. Model selection (Classification or Regression)
7. hyper parameter description

Amongst this Feature pre-processing & selection and Hyper-Parameter selection i have
never been through, so thanks for that :)
I really loved the way you explained, Keep doing the good stuff Abhishek Thankur
△ ▽ • Reply • Share ›

Dan Ofer • 8 months ago


The really nice bit is the writeup on hyperparams worth noting and ranges.
△ ▽ • Reply • Share ›

Th3OnlyOn3 • 10 months ago


Very impressive!
△ ▽ • Reply • Share ›

Huey Kwik • a year ago


Instead of having a separate validation set, why not just use cross-validation?
△ ▽ • Reply • Share ›

Adrian Gasinski • a year ago


Wow, It is a great article showing professional an practical approach to data mining
problems. Thank you very much!
△ ▽ • Reply • Share ›

Sampath sree • a year ago


Great post. Thanks for sharing.
△ ▽ • Reply • Share ›

alchemication • a year ago


Amazing write up, very helpful, thanks!
△ ▽ • Reply • Share ›

Jeff • a year ago


I don't think OneHotEncoder is actually necessary to simply using the LabelEncoder
I don t think OneHotEncoder is actually necessary to simply using the LabelEncoder.
RandomForest can handle integer labels. Also, it's interesting that you used next(iter(kf))
instead of using sklearn.cross_validation.train_test_split (which is a wrapper for
next(iter(kf))).
△ ▽ • Reply • Share ›

abhi > Jeff • a year ago


yes. RandomForest can handle integer labels, but the post isn't just about random
forest or GBMSs ;)
i used next(iter(kf)) only to make my simple post a bit complicated.... lol
1△ ▽ • Reply • Share ›

 
(https://www.facebook.com/kaggle) (https://twitter.com/kaggle)

You might also like