0% found this document useful (0 votes)

125 views22 pages

Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch

This document discusses an approach and framework for applying machine learning models to problems. It begins by discussing preparing data in a tabular format and identifying the type of machine learning problem (classification, regression, etc). It then covers splitting data into training and validation sets, identifying variable types (numeric, categorical, text), preprocessing techniques for each type, and applying machine learning models to the preprocessed data. The framework shown involves identifying the problem, preprocessing data, building models, and evaluating on a validation set to avoid overfitting.

Uploaded by

pepe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views22 pages

Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch

Uploaded by

pepe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22



No Free Hunch
(http://blog.kaggle.com/)

 (HTTP://BLOG.KAGGLE.COM)  APPROACHING (ALMOST) ANY MACHINE LEARNING PROBLEM | ABHISHEK
THAKUR
 (HTTP://BLOG.KAGGLE.COM/2016/07/27/AVITO-DUPLICATE-ADS-DETECTION-WINNERS-INTERVIEW-3RD-
PLACE-MARIO-GERARD-KELE-PRAVEEN-GILBERTO/)  (HTTP://BLOG.KAGGLE.COM/2016/07/14/KAGGLE-
MASTER-DATA-SCIENTIST-AUTHOR-AN-INTERVIEW-WITH-LUCA-MASSARON/)

44
Approaching (Almost) Any Machine Learning (http://b
Problem | Abhishek Thakur
Kaggle Team (http://blog.kaggle.com/author/kaggleteam/) | 07.21.2016 almost-

any-

machine

learning

problem

abhishe

thakur/

Abhishek Thakur (https://www.kaggle.com/abhishek), a Kaggle Grandmaster, originally published this post

here (https://www.linkedin.com/pulse/approaching-almost-any-machine-learning-problem-abhishek-
thakur) on July 18th, 2016 and kindly gave us permission to cross-post on No Free Hunch

An average data scientist deals with loads of data daily. Some say over 60-70% time is spent in data
cleaning, munging and bringing data to a suitable format such that machine learning models can be
applied on that data. This post focuses on the second part, i.e., applying machine learning models,
including the preprocessing steps. The pipelines discussed in this post come as a result of over a
hundred machine learning competitions that I’ve taken part in. It must be noted that the discussion
here is very general but very useful and there can also be very complicated methods which exist and
are practised by professionals.

We will be using python!

Data
Before applying the machine learning models, the data must be converted to a tabular form. This
whole process is the most time consuming and di cult process and is depicted in the gure below.
(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_1.png)

The machine learning models are then applied to the tabular data. Tabular data is most common
way of representing data in machine learning or data mining. We have a data table, rows with
di erent samples of the data or X and labels, y. The labels can be single column or multi-column,
depending on the type of problem. We will denote data by X and labels by y.

Types of labels
The labels de ne the problem and can be of di erent types, such as:

Single column, binary values (classi cation problem, one sample belongs to one class only and
there are only two classes)
Single column, real values (regression problem, prediction of only one value)
Multiple column, binary values (classi cation problem, one sample belongs to one class, but
there are more than two classes)
Multiple column, real values (regression problem, prediction of multiple values)
And multilabel (classi cation problem, one sample can belong to several classes)

Evaluation Metrics
For any kind of machine learning problem, we must know how we are going to evaluate our results,
or what the evaluation metric or objective is. For example in case of a skewed binary classi cation
problem we generally choose area under the receiver operating characteristic curve (ROC AUC or
simply AUC). In case of multi-label or multi-class classi cation problems, we generally choose
categorical cross-entropy or multiclass log loss and mean squared error in case of regression
problems.

I won’t go into details of the di erent evaluation metrics as we can have many di erent types,
depending on the problem.

The Libraries
To start with the machine learning libraries, install the basic and most important ones rst, for
example, numpy and scipy.

To see and do operations on data: pandas (http://pandas.pydata.org/ (http://pandas.pydata.org/))

For all kinds of machine learning models: scikit-learn (http://scikit-learn.org/stable/ (http://scikit-
learn.org/stable/))
The best gradient boosting library: xgboost (https://github.com/dmlc/xgboost
(https://github.com/dmlc/xgboost))
For neural networks: keras (http://keras.io/ (http://keras.io/))
For plotting data: matplotlib (http://matplotlib.org/ (http://matplotlib.org/))
To monitor progress: tqdm (https://pypi.python.org/pypi/tqdm
(https://pypi.python.org/pypi/tqdm))

I don’t use Anaconda (https://www.continuum.io/downloads

(https://www.continuum.io/downloads)). It’s easy and does everything for you, but I want more
freedom. The choice is yours.

The Machine Learning Framework

In 2015, I came up with a framework for automatic machine learning which is still under
development and will be released soon. For this post, the same framework will be the basis. The
framework is shown in the gure below:

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_2.png)
FIGURE FROM: A. THAKUR AND A. KROHN-GRIMBERGHE, AUTOCOMPETE: A FRAMEWORK FOR MACHINE LEARNING
COMPETITIONS, AUTOML WORKSHOP, INTERNATIONAL CONFERENCE ON MACHINE LEARNING 2015.
In the framework shown above, the pink lines represent the most common paths followed. After we
have extracted and reduced the data to a tabular format, we can go ahead with building machine
learning models.

The very rst step is identi cation of the problem. This can be done by looking at the labels. One
must know if the problem is a binary classi cation, a multi-class or multi-label classi cation or a
regression problem. After we have identi ed the problem, we split the data into two di erent parts,
a training set and a validation set as depicted in the gure below.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_3.png)

The splitting of data into training and validation sets “must” be done according to labels. In case of
any kind of classi cation problem, use strati ed splitting. In python, you can do this using scikit-learn
very easily.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_4.png)

In case of regression task, a simple K-Fold splitting should su ce. There are, however, some complex
methods which tend to keep the distribution of labels same for both training and validation set and
this is left as an exercise for the reader.
(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_5.png)

I have chosen the eval_size or the size of the validation set as 10% of the full data in the examples
above, but one can choose this value according to the size of the data they have.

After the splitting of the data is done, leave this data out and don’t touch it. Any operations that are
applied on training set must be saved and then applied to the validation set. Validation set, in any
case, should not be joined with the training set. Doing so will result in very good evaluation scores
and make the user happy but instead he/she will be building a useless model with very high
over tting.

Next step is identi cation of di erent variables in the data. There are usually three types of variables
we deal with. Namely, numerical variables, categorical variables and variables with text inside them.
Let’s take example of the popular Titanic dataset (https://www.kaggle.com/c/titanic/data
(https://www.kaggle.com/c/titanic/data)).

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_6.png)

Here, survival is the label. We have already separated labels from the training data in the previous
step. Then, we have pclass, sex, embarked. These variables have di erent levels and thus they are
categorical variables. Variables like age, sibsp, parch, etc are numerical variables. Name is a variable
with text data but I don’t think it’s a useful variable to predict survival.
Separate out the numerical variables rst. These variables don’t need any kind of processing and
thus we can start applying normalization and machine learning models to these variables.

There are two ways in which we can handle categorical data:

Convert the categorical data to labels

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_7.png)

Convert the labels to binary variables (one-hot encoding)

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_8.png)

Please remember to convert categories to numbers rst using LabelEncoder before applying
OneHotEncoder on it.

Since, the Titanic data doesn’t have good example of text variables, let’s formulate a general rule on
handling text variables. We can combine all the text variables into one and then use some algorithms
which work on text data and convert it to numbers.

The text variables can be joined as follows:

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_9.png)

We can then use CountVectorizer or T dfVectorizer on it:

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_10.png)
or,

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_11.png)

The T dfVectorizer performs better than the counts most of the time and I have seen that the
following parameters for T dfVectorizer work almost all the time.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_12.png)

If you are applying these vectorizers only on the training set, make sure to dump it to hard drive so
that you can use it later on the validation set.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_13.png)

Next, we come to the stacker module. Stacker module is not a model stacker but a feature stacker.
The di erent features after the processing steps described above can be combined using the stacker
module.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_14.png)

You can horizontally stack all the features before putting them through further processing by using
numpy hstack or sparse hstack depending on whether you have dense or sparse features.
(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_15.png)

And can also be achieved by FeatureUnion module in case there are other processing steps such as
pca or feature selection (we will visit decomposition and feature selection later in this post).

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_16.png)

Once, we have stacked the features together, we can start applying machine learning models. At this
stage only models you should go for should be ensemble tree based models. These models include:

RandomForestClassi er
RandomForestRegressor
ExtraTreesClassi er
ExtraTreesRegressor
XGBClassi er
XGBRegressor

We cannot apply linear models to the above features since they are not normalized. To use linear
models, one can use Normalizer or StandardScaler from scikit-learn.

These normalization methods work only on dense features and don’t give very good results if applied
on sparse features. Yes, one can apply StandardScaler on sparse matrices without using the mean
(parameter: with_mean=False).

If the above steps give a “good” model, we can go for optimization of hyperparameters and in case it
doesn’t we can go for the following steps and improve our model.

The next steps include decomposition methods:

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_17.png)

For the sake of simplicity, we will leave out LDA and QDA transformations. For high dimensional data,
generally PCA is used decompose the data. For images start with 10-15 components and increase
this number as long as the quality of result improves substantially. For other type of data, we select
50-60 components initially (we tend to avoid PCA as long as we can deal with the numerical data as it
is).

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_18.png)

For text data, after conversion of text to sparse matrix, go for Singular Value Decomposition (SVD). A
variation of SVD called TruncatedSVD can be found in scikit-learn.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_decomp.png)

The number of SVD components that generally work for TF-IDF or counts are between 120-200. Any
number above this might improve the performance but not substantially and comes at the cost of
computing power.

After evaluating further performance of the models, we move to scaling of the datasets, so that we
can evaluate linear models too. The normalized or scaled features can then be sent to the machine
learning models or feature selection modules.
(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_19.png)

There are multiple ways in which feature selection can be achieved. One of the most common way is
greedy feature selection (forward or backward). In greedy feature selection we choose one feature,
train a model and evaluate the performance of the model on a xed evaluation metric. We keep
adding and removing features one-by-one and record performance of the model at every step. We
then select the features which have the best evaluation score. One implementation of greedy feature
selection with AUC as evaluation metric can be found here:
https://github.com/abhishekkrthakur/greedyFeatureSelection
(https://github.com/abhishekkrthakur/greedyFeatureSelection). It must be noted that this
implementation is not perfect and must be changed/modi ed according to the requirements.

Other faster methods of feature selection include selecting best features from a model. We can
either look at coe cients of a logit model or we can train a random forest to select best features and
then use them later with other machine learning models.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_20.png)

Remember to keep low number of estimators and minimal optimization of hyper parameters so that
you don’t over t.

The feature selection can also be achieved using Gradient Boosting Machines. It is good if we use
xgboost instead of the implementation of GBM in scikit-learn since xgboost is much faster and more
scalable.
(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_21.png)

We can also do feature selection of sparse datasets using RandomForestClassi er /

RandomForestRegressor and xgboost.

Another popular method for feature selection from positive sparse datasets is chi-2 based feature
selection and we also have that implemented in scikit-learn.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_22.png)

Here, we use chi2 in conjunction with SelectKBest to select 20 features from the data. This also
becomes a hyperparameter we want to optimize to improve the result of our machine learning
models.

Don’t forget to dump any kinds of transformers you use at all the steps. You will need them to
evaluate performance on the validation set.

Next (or intermediate) major step is model selection + hyperparameter optimization.

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_23.png)

We generally use the following algorithms in the process of selecting a machine learning model:
Classi cation:
Random Forest
GBM
Logistic Regression
Naive Bayes
Support Vector Machines
k-Nearest Neighbors

Regression
Random Forest
GBM
Linear Regression
Ridge
Lasso
SVR

Which parameters should I optimize? How do I choose parameters closest to the best ones? These
are a couple of questions people come up with most of the time. One cannot get answers to these
questions without experience with di erent models + parameters on a large number of datasets.
Also people who have experience are not willing to share their secrets. Luckily, I have quite a bit of
experience too and I’m willing to give away some of the stu .

Let’s break down the hyperparameters, model wise:

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_24.png)

RS* = Cannot say about proper values, go for Random Search in these hyperparameters.

In my opinion, and strictly my opinion, the above models will out-perform any others and we don’t
need to evaluate any other models.

Once again, remember to save the transformers:

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_25.png)

And apply them on validation set separately:

(http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_26.png)

The above rules and the framework has performed very well in most of the datasets I have dealt
with. Of course, it has also failed for very complicated tasks. Nothing is perfect and we keep on
improving on what we learn. Just like in machine learning.

Get in touch with me with any doubts: abhishek4 [at] gmail [dot] com

Bio
Abhishek Thakur (https://www.kaggle.com/abhishek) works as a Senior Data Scientist on the
Data Science team at Searchmetrics Inc (http://www.searchmetrics.com/). At Searchmetrics,
Abhishek works on some of the most interesting data driven studies, applied machine learning
algorithms and deriving insights from huge amount of data which require a lot of data munging,
cleaning, feature engineering and building and optimization of machine learning models.

In his free time, he likes to take part in machine learning competitions and has taken part in over 100
competitions. His research interests include automatic machine learning, deep learning,
hyperparameter optimization, computer vision, image analysis and retrieval and pattern recognition.
(http://5047-
presscdn.pagely.netdna-
cdn.com/wp-
content/uploads/2016/07/abhishek.png)
ABHISHEK THAKUR
(HTTPS://WWW.KAGGLE.COM/ABHISHEK),
COMPETITIONS GRANDMASTER.

PYTHON (HTTP://BLOG.KAGGLE.COM/TAG/PYTHON/)

TUTORIAL (HTTP://BLOG.KAGGLE.COM/TAG/TUTORIAL/)

44 Comments No Free Hunch 

1 Login

Sort by Best
 Recommend 72 ⤤ Share

Join the discussion…

Name

Bojan Tunguz • a year ago

Great post! This is one of the most useful writeups about the general approach to ML that
I've come across.
6△ ▽ • Reply • Share ›

stuti awasthi • a year ago

Exceptionally written post. Thanks for sharing hyper params optimization generalized
rules. Definetly a good read.
Thanks
4△ ▽ • Reply • Share ›

Bruce Robbins • a year ago

I agree with the other commentators on the value of this post, if for nothing else, for
confirming the “The dirty little secret of big data,” being the fact , “that most data analysts
spend the vast majority of their time cleaning and integrating data — not actually analysing
it.” At a practical level the overview on hyperparameters optimisation and evaluation is
very useful.
4△ ▽ • Reply • Share ›

Dhiraj Tandon • a year ago

Awsome post Abhishek
4△ ▽ • Reply • Share ›

Ayush Singh • a year ago

Thanks for sharing this informative post!
I have a question abt how to go about choosing the statistical test i.e Chi2 ,ANOVA etc for
SelectKbest in sklearn.I have read that Chi is used to see correlation btw Categorical vs
Categorical variables and ANOVA for Categorical vs Continuous variables.I see here that
you have used Chi2 test for k=20 but there are some features that are continuous like
Age,Fare with target variable as Categorical so I am a bit confused.
3△ ▽ • Reply • Share ›

jdunn > Ayush Singh • a year ago

Must learn anova
△ ▽ • Reply • Share ›

abhi • a year ago

Thanks all!
2△ ▽ • Reply • Share ›

PANKAJ • a year ago

Fantastic post..from the vast jungle of possibilities this is one way to get to results!
2△ ▽ • Reply • Share ›

kevinzakka • a year ago

Great writeup, and couldn't have come at a better time. Thanks for sharing!
2△ ▽ • Reply • Share ›

Andy • 3 months ago

What do you mean by "you want more freedom"? How does anaconda give you less
freedom? Conda allows you to have different dynamically linked C libraries for different
python environments, that seems "more freedom" to me. You can still pip install things that
are not available on conda - or you can use conda-forge or other channels.
1△ ▽ • Reply • Share ›

Mohamed Rachidi • a year ago

Very good post, Thank you
1△ ▽ • Reply • Share ›

Yurii Shevchuk • a year ago

Thank you for the interesting post!
I want to clarify one thing. You've mentioned that

> We cannot apply linear models to the above features since they are not normalized.
Is it necessary to apply StandardScaler to the data when you try to fit a linear model?
Least Squares should work fine without feature scaling assuming that scale of variables is
not to big. Otherwise it can cause computational problems.
1△ ▽ • Reply • Share ›

Steve Joseph • a year ago

Thanks. I'm curious where neural networks/deep learning fits into this model?
1△ ▽ • Reply • Share ›

abhi > Steve Joseph • a year ago

I havent discussed neural nets in this post. But will soon write about parameter
tuning and network architecture selection for neural nets.
△ ▽ • Reply • Share ›

Sheshachalam Ratnala > abhi • a year ago

I am also a bit curious how it applies to Unsupervised Problem. The post
seems to be looking at supervised problem only
2△ ▽ • Reply • Share ›

StartupFlux • a year ago

Great post! Easy to understand and very informative. Thanks for Sharing!
1△ ▽ • Reply • Share ›

sampath kumar • a year ago

Great Post & Thank you.
1△ ▽ • Reply • Share ›

Trần Đức Nhuận • a year ago

Amazing great article guidelines. Thanks Abhishek
1△ ▽ • Reply • Share ›

Meghna Natraj • 20 days ago

Probably the best Machine learning framework post ever! :)
△ ▽ • Reply • Share ›

Gavin Zhu • a month ago

Can you give an example about how to save transformations?
△ ▽ • Reply • Share ›

Ajay Tanpure • 2 months ago

Great post.
I have one question. Is it possible to retrain the model with new data.
Here is the detailed explaination:
https://stackoverflow.com/q...
△ ▽ • Reply • Share ›

Neeladree Chakravorty • 2 months ago

Thanks for the compilation Abhishek! I would like to discuss with you or other Data
Scientists here about how to decide on our choice for out-of-time (OOT) validation set. Let
me know if someone would like to share their experience on choosing OOT test-set from a
me know if someone would like to share their experience on choosing OOT test-set from a
pool of data for testing predictive models. Thanks!
△ ▽ • Reply • Share ›

Belal Chaudhary • 6 months ago

Awesome post Abhishek - thanks for the insights!
△ ▽ • Reply • Share ›

jdunn • 6 months ago

@abhi When you say the following "In my opinion, and strictly my opinion, the above
models will out-perform any others and we don’t need to evaluate any other models.", how
do you include NNs and FM/FFM?
Can I assume you are not including recommenders at all and so can discount FM/FFM?
In that case how you include things such as Keras models?
△ ▽ • Reply • Share ›

Diego • 6 months ago

I think min_samples_split in random forest must be at least 2 or in (0, 1]
△ ▽ • Reply • Share ›

Robert • 6 months ago

@abhi Do you see potential for your Autocompete to become a core element of an
artificial general intelligence? Like a missing piece that would allow an AI to approach data
from *any* field with the same core algorithm to continuously produce new knowledge.
(Sorry if you get this question a lot)

My first instinct in solving problems is usually "How can I generalize it so that I will never
need to do that again myself?". I started learning ML just recently, and the first thing I did
was write a little data preprocessor and learned to use numpy and pandas in the process.
So maybe I'll also expand my preprocessor into an "autocompeter" just to understand
every step of the flow better =)
△ ▽ • Reply • Share ›

Robert > Robert • 6 months ago

@abhi also in your article when you talk about vectorizing text data, you use fitting
on the test data, which creates a vocabulary different from the training data. If I'm
not mistaken, you should only use transform on the test/validation data without
fitting to it so that you would only count the frequencies of words that existed in the
training set and not words that are introduced in the test set for the first time.
1△ ▽ • Reply • Share ›

Khin hay man • 7 months ago

HI bro abhishek. I am Khin, I am doing master in Computer science. now i am in the stage
of doing experimental. I am doing natural language processing using python based on
training and testing.In my training data, i applied some rules , based on the rules, my
testing data will label ambiguous or not by using Naive Bayes text classification. Now, i got
my output, but it is not accurate because it does not applied rules. I think i need to do
feature extraction by writing my own code. I do not know how to do that. I hope u can give
me some suggestion about my experimental.

Thanking u in an advanced
Regards, khin
△ ▽ • Reply • Share ›

Jinesh Shah • 7 months ago

Thanks for sharing your knowledge and approach with us!
△ ▽ • Reply • Share ›

Ayub Quadri • 8 months ago

One of the best post i have ever read, which cover

1. Various types of Label (Binary class classification, Muliti class classification,

Regression, multilabel)
2. Evaluation Metric understanding (AUC, RMSE, Cross Entropy)
3. Type of data (Categorical, Continuous, Text)
4. Feature pre-processing & selection(Numerical, categorical & Text)
5. Splitting data(test train) with k fold validation
6. Model selection (Classification or Regression)
7. hyper parameter description

Amongst this Feature pre-processing & selection and Hyper-Parameter selection i have
never been through, so thanks for that :)
I really loved the way you explained, Keep doing the good stuff Abhishek Thankur
△ ▽ • Reply • Share ›

Dan Ofer • 8 months ago

The really nice bit is the writeup on hyperparams worth noting and ranges.
△ ▽ • Reply • Share ›

Th3OnlyOn3 • 10 months ago

Very impressive!
△ ▽ • Reply • Share ›

Huey Kwik • a year ago

Instead of having a separate validation set, why not just use cross-validation?
△ ▽ • Reply • Share ›

Adrian Gasinski • a year ago

Wow, It is a great article showing professional an practical approach to data mining
problems. Thank you very much!
△ ▽ • Reply • Share ›

Sampath sree • a year ago

Great post. Thanks for sharing.
△ ▽ • Reply • Share ›

alchemication • a year ago

Amazing write up, very helpful, thanks!
△ ▽ • Reply • Share ›

Jeff • a year ago

I don't think OneHotEncoder is actually necessary to simply using the LabelEncoder
I don t think OneHotEncoder is actually necessary to simply using the LabelEncoder.
RandomForest can handle integer labels. Also, it's interesting that you used next(iter(kf))
instead of using sklearn.cross_validation.train_test_split (which is a wrapper for
next(iter(kf))).
△ ▽ • Reply • Share ›

abhi > Jeff • a year ago

yes. RandomForest can handle integer labels, but the post isn't just about random
forest or GBMSs ;)
i used next(iter(kf)) only to make my simple post a bit complicated.... lol
1△ ▽ • Reply • Share ›

 
(https://www.facebook.com/kaggle) (https://twitter.com/kaggle)

Aws ML PDF
No ratings yet
Aws ML PDF
74 pages
Introduction To Scikit Learn
100% (1)
Introduction To Scikit Learn
108 pages
Learning From Class Imbalanced Data Review of Methods and Applications
No ratings yet
Learning From Class Imbalanced Data Review of Methods and Applications
20 pages
Unit V
No ratings yet
Unit V
45 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
Effective Ensemble Learning Phishing Detection System Using Hybrid Feature Selection
No ratings yet
Effective Ensemble Learning Phishing Detection System Using Hybrid Feature Selection
16 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
2016.random Forest in Remote Sensing A Review of Applications and Future
No ratings yet
2016.random Forest in Remote Sensing A Review of Applications and Future
8 pages
Developing A Machining Learning Models From Start To Finish.
No ratings yet
Developing A Machining Learning Models From Start To Finish.
59 pages
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
No ratings yet
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
119 pages
Introduction To ML
No ratings yet
Introduction To ML
55 pages
Machine Learning Notes Unit 1 To 4
No ratings yet
Machine Learning Notes Unit 1 To 4
101 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
47 pages
UNIT3
No ratings yet
UNIT3
98 pages
Cse3001 Ai ML m2
No ratings yet
Cse3001 Ai ML m2
118 pages
AIch 5
No ratings yet
AIch 5
50 pages
Ai 900
No ratings yet
Ai 900
64 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
ML Chap 2
No ratings yet
ML Chap 2
60 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
Algorithmeknn 121213175830 Phpapp02
No ratings yet
Algorithmeknn 121213175830 Phpapp02
52 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
ML Lab Manual Completed
No ratings yet
ML Lab Manual Completed
56 pages
Template
No ratings yet
Template
31 pages
Lecture 2
No ratings yet
Lecture 2
36 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Lect3 Machine Learning
No ratings yet
Lect3 Machine Learning
27 pages
Research Proposal
No ratings yet
Research Proposal
8 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Paper 2
No ratings yet
Paper 2
19 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
21 pages
DS Cheat Sheets
No ratings yet
DS Cheat Sheets
18 pages
MCS224 Dec 2024 Solved
No ratings yet
MCS224 Dec 2024 Solved
22 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Mba 409B Set A
No ratings yet
Mba 409B Set A
21 pages
Unit 1,2,3
No ratings yet
Unit 1,2,3
30 pages
ML and Deploying It Using Flask and Docker.
No ratings yet
ML and Deploying It Using Flask and Docker.
30 pages
Minor Project
No ratings yet
Minor Project
21 pages
Air Quality Prediction Using Machine Learning
No ratings yet
Air Quality Prediction Using Machine Learning
29 pages
Lec2 Intro To ML
No ratings yet
Lec2 Intro To ML
35 pages
Unit 2
No ratings yet
Unit 2
19 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Price Action Prediciton DL
No ratings yet
Price Action Prediciton DL
26 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
Deep Learning and Machine Learning: Lab Explanation
No ratings yet
Deep Learning and Machine Learning: Lab Explanation
34 pages
Unit 1
No ratings yet
Unit 1
28 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
20AI16 - ML Record
No ratings yet
20AI16 - ML Record
24 pages
Robin 1 PDF
No ratings yet
Robin 1 PDF
20 pages
Deep Learning
No ratings yet
Deep Learning
25 pages
Deep Learning Workflow
No ratings yet
Deep Learning Workflow
11 pages
A Machine Learning Prediction Model For Preterm Birth in Rural India
No ratings yet
A Machine Learning Prediction Model For Preterm Birth in Rural India
11 pages
Week 3 A
No ratings yet
Week 3 A
18 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Feature Selection & Feature Extraction
No ratings yet
Feature Selection & Feature Extraction
19 pages
Prediction of Gold Price Movement Using
No ratings yet
Prediction of Gold Price Movement Using
12 pages
Using Data Mining To Predict Student Performance
No ratings yet
Using Data Mining To Predict Student Performance
12 pages
Feature Selection
No ratings yet
Feature Selection
18 pages
Parkinsons
No ratings yet
Parkinsons
22 pages
ML Notion 1
No ratings yet
ML Notion 1
18 pages
Classification of Holy Quran Translation
No ratings yet
Classification of Holy Quran Translation
8 pages
Machine Learning Life Cycle
No ratings yet
Machine Learning Life Cycle
11 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
No ratings yet
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
9 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Learn Machine Learning in One Lesson Book
No ratings yet
Learn Machine Learning in One Lesson Book
8 pages
Review On Online Feature Selection
No ratings yet
Review On Online Feature Selection
4 pages
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
No ratings yet
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
22 pages
A Hybrid Deep Learning Model For Consumer Credit Scoring: Bing Zhu, Wenchuan Yang, Huaxuan Wang, Yuan Yuan
No ratings yet
A Hybrid Deep Learning Model For Consumer Credit Scoring: Bing Zhu, Wenchuan Yang, Huaxuan Wang, Yuan Yuan
4 pages
Sumit Dutta QR2326 Intern
No ratings yet
Sumit Dutta QR2326 Intern
2 pages
A Machine Learning Application For Latency Prediction in Operational 4G Networks
No ratings yet
A Machine Learning Application For Latency Prediction in Operational 4G Networks
4 pages
Methodology For Long-Term Prediction of Time Series: Antti Sorjamaa, Jin Hao, Nima Reyhani, Yongnan Ji, Amaury Lendasse
No ratings yet
Methodology For Long-Term Prediction of Time Series: Antti Sorjamaa, Jin Hao, Nima Reyhani, Yongnan Ji, Amaury Lendasse
9 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Module - 1
No ratings yet
Module - 1
9 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Detection of Advanced Malware by Machine Learning Techniques
No ratings yet
Detection of Advanced Malware by Machine Learning Techniques
8 pages
CS 461 - Fall 2021 - Neural Networks - Machine Learning
No ratings yet
CS 461 - Fall 2021 - Neural Networks - Machine Learning
5 pages
ML
No ratings yet
ML
8 pages
Introduction To Machine Learning Top-Down Approach - Towards Data Science
No ratings yet
Introduction To Machine Learning Top-Down Approach - Towards Data Science
6 pages
Accenture
No ratings yet
Accenture
3 pages
Final ML
No ratings yet
Final ML
2 pages
Optimizing AI and Machine Learning Solutions: Your ultimate guide to building high-impact ML/AI solutions (English Edition)
From Everand
Optimizing AI and Machine Learning Solutions: Your ultimate guide to building high-impact ML/AI solutions (English Edition)
Mirza Rahim Baig
No ratings yet
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
From Everand
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
PARTHA MAJUMDAR
No ratings yet
Mastering Machine Learning Algorithms - Second Edition: Expert techniques for implementing popular machine learning algorithms, fine-tuning your models, and understanding how they work, 2nd Edition
From Everand
Mastering Machine Learning Algorithms - Second Edition: Expert techniques for implementing popular machine learning algorithms, fine-tuning your models, and understanding how they work, 2nd Edition
Giuseppe Bonaccorso
2/5 (1)
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
From Everand
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
Manish Soni
No ratings yet

Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch

Uploaded by

Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch

Uploaded by



Abhishek Thakur (https://www.kaggle.com/abhishek), a Kaggle Grandmaster, originally published this post

We will be using python!

To see and do operations on data: pandas (http://pandas.pydata.org/ (http://pandas.pydata.org/))

I don’t use Anaconda (https://www.continuum.io/downloads

The Machine Learning Framework

There are two ways in which we can handle categorical data:

Convert the categorical data to labels

Convert the labels to binary variables (one-hot encoding)

The text variables can be joined as follows:

We can then use CountVectorizer or T dfVectorizer on it:

The next steps include decomposition methods:

We can also do feature selection of sparse datasets using RandomForestClassi er /

Next (or intermediate) major step is model selection + hyperparameter optimization.

Let’s break down the hyperparameters, model wise:

Once again, remember to save the transformers:

And apply them on validation set separately:

44 Comments No Free Hunch 

Join the discussion…

Bojan Tunguz • a year ago

stuti awasthi • a year ago

Bruce Robbins • a year ago

Dhiraj Tandon • a year ago

Ayush Singh • a year ago

jdunn > Ayush Singh • a year ago

abhi • a year ago

PANKAJ • a year ago

kevinzakka • a year ago

Andy • 3 months ago

Mohamed Rachidi • a year ago

Yurii Shevchuk • a year ago

Steve Joseph • a year ago

abhi > Steve Joseph • a year ago

Sheshachalam Ratnala > abhi • a year ago

StartupFlux • a year ago

sampath kumar • a year ago

Trần Đức Nhuận • a year ago

Meghna Natraj • 20 days ago

Gavin Zhu • a month ago

Ajay Tanpure • 2 months ago

Neeladree Chakravorty • 2 months ago

Belal Chaudhary • 6 months ago

jdunn • 6 months ago

Diego • 6 months ago

Robert • 6 months ago

Robert > Robert • 6 months ago

Khin hay man • 7 months ago

Jinesh Shah • 7 months ago

Ayub Quadri • 8 months ago

1. Various types of Label (Binary class classification, Muliti class classification,

Dan Ofer • 8 months ago

Th3OnlyOn3 • 10 months ago

Huey Kwik • a year ago

Adrian Gasinski • a year ago

Sampath sree • a year ago

alchemication • a year ago

Jeff • a year ago

abhi > Jeff • a year ago

You might also like