Data Science - Full Archive
Data Science - Full Archive
blog.DailyDoseofDS.com
blog.DailyDoseofDS.com
Table of Contents
The Must-Know Categorisa3on of Discrimina3ve Models ................................12
Where Did The Regulariza3on Term Originate From? ......................................18
How to Create The Elegant Moving Bubbles Chart in Python? .........................22
Gradient Checkpoin3ng: Save 50-60% Memory When Training a Neural
Network ..........................................................................................................24
Gaussian Mixture Models: The Flexible Twin of KMeans ..................................28
Why Correla3on (and Other Summary Sta3s3cs) Can Be Misleading ...............33
MissForest: A Be[er Alterna3ve To Zero (or Mean) Imputa3on .......................35
A Visual and Intui3ve Guide to The Bias-Variance Problem ..............................39
The Most Under-appreciated Technique To Speed-up Python ...........................41
The Overlooked Limita3ons of Grid Search and Random Search ......................44
An Intui3ve Guide to Genera3ve and Discrimina3ve Models in Machine
Learning ..........................................................................................................48
Feature Scaling is NOT Always Necessary ........................................................55
Why Sigmoid in Logis3c Regression? ...............................................................58
Build Elegant Data Apps With The Coolest Mito-Streamlit Integra3on .............62
A Simple and Intui3ve Guide to Understanding Precision and Recall ................64
Skimpy: A Richer Alterna3ve to Pandas' Describe Method ...............................69
A Common Misconception About Model Reproducibility ................................71
The Biggest Limita3on Of Pearson Correla3on Which Many Overlook .............76
Gigasheet: Effortlessly Analyse Upto 1 Billion Rows Without Any Code ............78
Why Mean Squared Error (MSE)? ....................................................................82
A More Robust and Underrated Alterna3ve To Random Forests ......................90
The Most Overlooked Problem With Impu3ng Missing Values Using Zero (or
Mean) .............................................................................................................93
A Visual Guide to Joint, Marginal and Condi3onal Probabili3es.......................95
Jupyter Notebook 7: Possibly One Of The Best Updates To Jupyter Ever ...........96
How to Find Op3mal Epsilon Value For DBSCAN Clustering? ............................97
Why R-squared is a Flawed Regression Metric .................................................99
75 Key Terms That All Data Scien3sts Remember By Heart.............................102
1
blog.DailyDoseofDS.com
2
blog.DailyDoseofDS.com
3
blog.DailyDoseofDS.com
4
blog.DailyDoseofDS.com
5
blog.DailyDoseofDS.com
Always Validate Your Output Variable Before Using Linear Regression ..........395
A Counterintui3ve Fact About Python Func3ons ............................................396
Why Is It Important To Shuffle Your Dataset Before Training An ML Model ....397
The Limita3ons Of Heatmap That Are Slowing Down Your Data Analysis .......398
The Limita3on Of Pearson Correla3on Which Many Ojen Ignore ..................399
Why Are We Typically Advised To Set Seeds for Random Generators? ............400
An Underrated Technique To Improve Your Data Visualiza3ons .....................401
A No-Code Tool to Create Charts and Pivot Tables in Jupyter..........................402
If You Are Not Able To Code A Vectorized Approach, Try This. ........................403
Why Are We Typically Advised To Never Iterate Over A DataFrame?..............405
Manipula3ng Mutable Objects In Python Can Get Confusing At Times ..........406
This Small Tweak Can Significantly Boost The Run-3me of KMeans ...............408
Most Python Programmers Don't Know This About Python OOP ...................410
Who Said Matplotlib Cannot Create Interac3ve Plots? ..................................412
Don't Create Messy Bar Plots. Instead, Try Bubble Charts! .............................413
You Can Add a List As a Dic3onary's Key (Technically)! ...................................414
Most ML Folks Ojen Neglect This While Using Linear Regression ..................415
35 Hidden Python Libraries That Are Absolute Gems .....................................416
Use Box Plots With Cau3on! They May Be Misleading. ..................................417
An Underrated Technique To Create Be[er Data Plots ...................................418
The Pandas DataFrame Extension Every Data Scien3st Has Been Wai3ng For 419
Supercharge Shell With Python Using Xonsh .................................................420
Most Command-line Users Don't Know This Cool Trick About Using Terminals
.....................................................................................................................421
A Simple Trick to Make The Most Out of Pivot Tables in Pandas .....................422
Why Python Does Not Offer True OOP Encapsula3on.....................................423
Never Worry About Parsing Errors Again While Reading CSV with Pandas .....424
An Interes3ng and Lesser-Known Way To Create Plots Using Pandas .............425
Most Python Programmers Don't Know This About Python For-loops ............426
How To Enable Func3on Overloading In Python .............................................427
Generate Helpful Hints As You Write Your Pandas Code .................................428
Speedup NumPy Methods 25x With Bo[leneck .............................................429
Visualizing The Data Transforma3on of a Neural Network ............................430
6
blog.DailyDoseofDS.com
Never Refactor Your Code Manually Again. Instead, Use Sourcery! ................431
Draw The Data You Are Looking For In Seconds .............................................432
Style Matplotlib Plots To Make Them More A[rac3ve ...................................433
Speed-up Parquet I/O of Pandas by 5x...........................................................434
40 Open-Source Tools to Supercharge Your Pandas Workflow ........................435
Stop Using The Describe Method in Pandas. Instead, use Skimpy. ..................436
The Right Way to Roll Out Library Updates in Python ....................................437
Simple One-Liners to Preview a Decision Tree Using Sklearn ..........................438
Stop Using The Describe Method in Pandas. Instead, use Summarytools. ......439
Never Search Jupyter Notebooks Manually Again To Find Your Code .............440
F-strings Are Much More Versa3le Than You Think ........................................441
Is This The Best Animated Guide To KMeans Ever? .........................................442
An Effec3ve Yet Underrated Technique To Improve Model Performance .........443
Create Data Plots Right From The Terminal ...................................................444
Make Your Matplotlib Plots More Professional ..............................................445
37 Hidden Python Libraries That Are Absolute Gems .....................................446
Preview Your README File Locally In GitHub Style .........................................447
Pandas and NumPy Return Different Values for Standard Devia3on. Why? ...448
Visualize Commit History of Git Repo With Beau3ful Anima3ons...................449
Perfplot: Measure, Visualize and Compare Run-3me With Ease .....................450
This GUI Tool Can Possibly Save You Hours Of Manual Work ..........................451
How Would You Iden3fy Fuzzy Duplicates In A Data With Million Records?....452
Stop Previewing Raw DataFrames. Instead, Use DataTables. .........................454
!
A Single Line That Will Make Your Python Code Faster ..............................455
Preufy Word Clouds In Python ......................................................................456
How to Encode Categorical Features With Many Categories? ........................457
Calendar Map As A Richer Alterna3ve to Line Plot .........................................458
10 Automated EDA Tools That Will Save You Hours Of (Tedious) Work ...........459
Why KMeans May Not Be The Apt Clustering Algorithm Always ....................460
Conver3ng Python To LaTeX Has Possibly Never Been So Simple ....................461
Density Plot As A Richer Alterna3ve to Sca[er Plot .......................................462
30 Python Libraries to (Hugely) Boost Your Data Science Produc3vity ............463
Sklearn One-liner to Generate Synthe3c Data ................................................464
7
blog.DailyDoseofDS.com
8
blog.DailyDoseofDS.com
9
blog.DailyDoseofDS.com
10
blog.DailyDoseofDS.com
11
blog.DailyDoseofDS.com
Let’s understand.
To recap:
Discriminative models:
12
blog.DailyDoseofDS.com
Generative models:
13
blog.DailyDoseofDS.com
Probabilistic models
Examples include:
• Logistic regression
• Neural networks
• CRFs
Labeling models
14
blog.DailyDoseofDS.com
Labeling models
Examples include:
• Random forests
• kNN
• Decision trees
15
blog.DailyDoseofDS.com
In other words, say a test instance reaches a specific leaf node for
final classification. The model will calculate the probabilities as
the fraction of training class labels in that leaf node.
16
blog.DailyDoseofDS.com
This is because the uncertainty is the same for all predictions that
land in the same leaf node.
! Over to you: Can you add one more model for probabilistic
and labeling models?
17
blog.DailyDoseofDS.com
This may happen because the model is trying too hard to capture
all unrelated and random noise in our training dataset, as
shown below:
18
blog.DailyDoseofDS.com
Now, if you have taken any ML course or read any tutorials about
this, the most common they teach is to add a penalty (or
regularization) term to the cost function, as shown below:
But why?
19
blog.DailyDoseofDS.com
And if you are curious, then this is precisely the topic of today’s
machine learning deep dive: “The Probabilistic Origin of
Regularization.”
20
blog.DailyDoseofDS.com
Thus, the objective of this deep dive is to help you build a solid
intuitive, and logical understanding of regularisation — purely
from a probabilistic perspective.
21
blog.DailyDoseofDS.com
22
blog.DailyDoseofDS.com
23
blog.DailyDoseofDS.com
This restricts us from training larger models and also limits the
max batch size that can potentially fit into memory.
Here, we run the forward pass normally and the core idea is to
optimize the backpropagation step.
24
blog.DailyDoseofDS.com
25
blog.DailyDoseofDS.com
As shown above:
26
blog.DailyDoseofDS.com
! Over to you: What are some ways you use to optimize a neural
network’s training?
27
blog.DailyDoseofDS.com
To begin:
28
blog.DailyDoseofDS.com
29
blog.DailyDoseofDS.com
30
blog.DailyDoseofDS.com
The notion that a single model can learn diverse data distributions
is truly captivating.
31
blog.DailyDoseofDS.com
32
blog.DailyDoseofDS.com
• the correlation
• the regression fit
33
blog.DailyDoseofDS.com
This can save you from drawing wrong conclusions, which you
may have drawn otherwise by solely looking at the summary
statistics.
This lets me infer if the scatter plot of two variables and their
corresponding correlation measure resonate with each other or
not.
! Over to you: What are some other measures you take when
using summary statistics?
34
blog.DailyDoseofDS.com
• inaccurate modeling
• incorrect conclusions, and more.
35
blog.DailyDoseofDS.com
36
blog.DailyDoseofDS.com
37
blog.DailyDoseofDS.com
38
blog.DailyDoseofDS.com
• Creating small bins will overfit the PDF. This leads to high
variance.
• Creating large bins will underfit the PDF. This leads to high
bias.
39
blog.DailyDoseofDS.com
40
blog.DailyDoseofDS.com
41
blog.DailyDoseofDS.com
This will run at native machine code speed. Just invoke the
method:
>>> foo_c(2)
42
blog.DailyDoseofDS.com
43
blog.DailyDoseofDS.com
• Grid search
• Random search
For instance:
44
blog.DailyDoseofDS.com
Grid search and random search can only try discrete values for
continuous hyperparameters
45
blog.DailyDoseofDS.com
46
blog.DailyDoseofDS.com
47
blog.DailyDoseofDS.com
• Generative
• Discriminative
48
blog.DailyDoseofDS.com
Discriminative models:
Examples include:
• Logistic regression
• Random Forest
• Neural Networks
• Decision Trees, etc.
Generative models
49
blog.DailyDoseofDS.com
Generative models:
Examples include:
• Naive Bayes
• Linear Discriminant Analysis (LDA)
• Gaussian Mixture Models, etc.
50
blog.DailyDoseofDS.com
51
blog.DailyDoseofDS.com
Can you figure out which of the above is generative and which
one is discriminative?
52
blog.DailyDoseofDS.com
It is like:
53
blog.DailyDoseofDS.com
Imagine the amount of data you would need to learn all languages
(generative approach) vs. the amount of data you would need to
understand some distinctive patterns (discriminative approach).
54
blog.DailyDoseofDS.com
55
blog.DailyDoseofDS.com
As shown above:
56
blog.DailyDoseofDS.com
Decision tree
Thus, it’s important to understand the nature of your data and the
algorithm you intend to use.
57
blog.DailyDoseofDS.com
But why?
58
blog.DailyDoseofDS.com
The most common reason we get to hear is that Sigmoid maps all
real values to the range [0,1].
59
blog.DailyDoseofDS.com
60
blog.DailyDoseofDS.com
We are covering:
61
blog.DailyDoseofDS.com
62
blog.DailyDoseofDS.com
63
blog.DailyDoseofDS.com
Precision
But it’s important that every positive prediction you get should
actually be positive.
64
blog.DailyDoseofDS.com
Precision Mindset: It’s okay to miss out on some good books but
recommend only good books
So even if this system recommended only one book and you liked
it, this gives a Precision of 100%.
65
blog.DailyDoseofDS.com
Recall
When you are in a Recall Mindset, you care about getting each
and every positive sample correctly classified.
66
blog.DailyDoseofDS.com
So even if this system says that all candidates (good or bad) are
fit for an interview, it gives you a Recall of 100%.
67
blog.DailyDoseofDS.com
68
blog.DailyDoseofDS.com
This includes:
• data shape
• column data types
• column summary statistics
• distribution chart,
• missing stats, etc.
69
blog.DailyDoseofDS.com
70
blog.DailyDoseofDS.com
You trained the model again and got the same performance.
71
blog.DailyDoseofDS.com
Here, we feed the input data to neural networks with the same
architecture but different randomizations. Next, we visualize the
transformation using a 2D dummy layer, as I depicted in one of
my previous posts below:
All models separate the data pretty well and give 100% accuracy,
don’t they?
No, right?
72
blog.DailyDoseofDS.com
73
blog.DailyDoseofDS.com
74
blog.DailyDoseofDS.com
• What is Pytest?
• How it simplifies pipeline testing?
• How to write and execute tests with Pytest?
• How to customize Pytest’s test search?
• How to create an organized testing suite using Pytest
markers?
• How to use fixtures to make your testing suite concise and
reliable?
• and more.
All in all, building test suites is one of the best skills you can
develop to build large and reliable data science pipelines.
75
blog.DailyDoseofDS.com
Monotonicity in data
76
blog.DailyDoseofDS.com
77
blog.DailyDoseofDS.com
78
blog.DailyDoseofDS.com
Thus, you can do all of the following without worrying about any
infra issues:
79
blog.DailyDoseofDS.com
o merge,
o plot,
o group,
80
blog.DailyDoseofDS.com
o sort,
o summary stats, etc.
• Import data from any source like AWS S3, Drive,
databases, etc., and analyze it, and more.
81
blog.DailyDoseofDS.com
82
blog.DailyDoseofDS.com
Let’s begin.
Here, epsilon is an error term that captures the random noise for a
specific data point (i).
83
blog.DailyDoseofDS.com
84
blog.DailyDoseofDS.com
Thus, we get:
Likelihood function
Simplifying, we get:
85
blog.DailyDoseofDS.com
86
blog.DailyDoseofDS.com
See, there’s clear proof and reasoning behind for using squared
error as a loss function in linear regression.
87
blog.DailyDoseofDS.com
In other words, have you ever wondered about the origin of linear
regression assumptions? The assumptions just can’t appear from
thin air, can they?
Thus today’s deep dive walks you through the origin of each of
the assumptions of linear regression in a lot of detail.
88
blog.DailyDoseofDS.com
89
blog.DailyDoseofDS.com
! Note:
90
blog.DailyDoseofDS.com
91
blog.DailyDoseofDS.com
Make sure you run it with bootstrap=True, otherwise, it will use the
whole dataset for each tree.
92
blog.DailyDoseofDS.com
• inaccurate modeling
• incorrect conclusions, and more.
93
blog.DailyDoseofDS.com
94
blog.DailyDoseofDS.com
This issue had mathema6cal deriva6ons and many diagrams. Please read it here:
hBps://www.blog.dailydoseofds.com/p/a-visual-guide-to-joint-marginal
95
blog.DailyDoseofDS.com
The update is in beta. You can read more about it here: Jupyter
Notebook 7.
96
blog.DailyDoseofDS.com
97
blog.DailyDoseofDS.com
For every data point, plot the distance to its kth nearest neighbor
(in increasing order).
98
blog.DailyDoseofDS.com
Let’s understand!
It is defined as follows:
99
blog.DailyDoseofDS.com
100
blog.DailyDoseofDS.com
! Read my full blog on the A-Z of R2, what it is, its limitations,
and much more here: Flaws of R2 Metric.
101
blog.DailyDoseofDS.com
Data science has a diverse glossary. The sheet lists the 75 most
common and important terms that data scientists use almost every
day.
102
blog.DailyDoseofDS.com
• A:
o Accuracy: Measure of the correct predictions divided
by the total predictions.
o Area Under Curve: Metric representing the area
under the Receiver Operating Characteristic (ROC)
curve, used to evaluate classification models.
o ARIMA: Autoregressive Integrated Moving
Average, a time series forecasting method.
• B:
o Bias: The difference between the true value and the
predicted value in a statistical model.
o Bayes Theorem: Probability formula that calculates
the likelihood of an event based on prior knowledge.
o Binomial Distribution: Probability distribution that
models the number of successes in a fixed number of
independent Bernoulli trials.
• C:
o Clustering: Grouping data points based on
similarities.
o Confusion Matrix: Table used to evaluate the
performance of a classification model.
o Cross-validation: Technique to assess model
performance by dividing data into subsets for training
and testing.
• D:
o Decision Trees: Tree-like model used for
classification and regression tasks.
o Dimensionality Reduction: Process of reducing the
number of features in a dataset while preserving
important information.
o Discriminative Models: Models that learn the
boundary between different classes.
• E:
103
blog.DailyDoseofDS.com
104
blog.DailyDoseofDS.com
105
blog.DailyDoseofDS.com
106
blog.DailyDoseofDS.com
• R:
o Random Forest: Ensemble learning method using
multiple decision trees to make predictions.
o Recall: Proportion of true positive predictions among
all actual positive instances in a classification model.
o ROC Curve (Receiver Operating Characteristic
Curve): Graph showing the performance of a binary
classifier at different thresholds.
• S:
o SVM (Support Vector Machine): Supervised
machine learning algorithm used for classification
and regression.
o Standardisation: Scaling data to have a mean of 0
and a standard deviation of 1.
o Sampling: Process of selecting a subset of data
points from a larger dataset.
• T:
o t-SNE (t-Distributed Stochastic Neighbor
Embedding): Dimensionality reduction technique for
visualizing high-dimensional data in lower
dimensions.
o t-distribution: Probability distribution used in
hypothesis testing when the sample size is small.
o Type I/II Error: Type I error is a false positive, and
Type II error is a false negative in hypothesis testing.
• U:
o Underfitting: When a model is too simple to capture
the underlying patterns in the data.
o UMAP (Uniform Manifold Approximation and
Projection): Dimensionality reduction technique for
visualizing high-dimensional data.
o Uniform Distribution: Probability distribution
where all outcomes are equally likely.
• V:
107
blog.DailyDoseofDS.com
108
blog.DailyDoseofDS.com
109
blog.DailyDoseofDS.com
• Glove
• Word2Vec
• FastText, etc.
For instance, running the vector operation (King - Man) + Woman would
return a vector near the word “Queen”.
110
blog.DailyDoseofDS.com
111
blog.DailyDoseofDS.com
BERT pre-training
Training DistilBERT
112
blog.DailyDoseofDS.com
113
blog.DailyDoseofDS.com
• A financial institution
• Sloping land
• A Long Ridge, and more.
114
blog.DailyDoseofDS.com
115
blog.DailyDoseofDS.com
• Initialize centroids
• Find the nearest centroid for each point
• Reassign centroids
• Repeat until convergence
O(i*n*k*d))
In fact, you can add another factor here — “the repetition factor”,
where, we run the whole clustering repeatedly to avoid
convergence issues.
116
blog.DailyDoseofDS.com
117
blog.DailyDoseofDS.com
For more info, here’s the paper that discussed it: Very Sparse
Random Projections.
118
blog.DailyDoseofDS.com
Today, I want to build on that and help you cultivate what I think
is one of the MOST overlooked and underappreciated skills in
developing linear models.
I can guarantee that harnessing this skill will give you so much
clarity and intuition in the modeling stages.
Recap
Having a non-negative response in the training data does not stop
linear regression from outputting negative values.
119
blog.DailyDoseofDS.com
While this is not an issue per se, negative outputs may not make
sense in cases where you can never have such outcomes.
For instance:
120
blog.DailyDoseofDS.com
It’s just that, in this specific use case, the data generation process
didn’t perfectly align with what linear regression is designed to
handle.
121
blog.DailyDoseofDS.com
122
blog.DailyDoseofDS.com
For instance:
See…
123
blog.DailyDoseofDS.com
I am confident this will help you get rid of that annoying and
helpless habit of relentlessly using a specific sklearn algorithm
without truly knowing why you are using it.
124
blog.DailyDoseofDS.com
125
blog.DailyDoseofDS.com
126
blog.DailyDoseofDS.com
They would:
Why?
Even though these embeddings have the same length, they are out
of space.
127
blog.DailyDoseofDS.com
But this is highly unlikely because there are infinitely many ways
axes may orient relative to each other.
128
blog.DailyDoseofDS.com
129
blog.DailyDoseofDS.com
Anyway.
Let’s understand!
130
blog.DailyDoseofDS.com
131
blog.DailyDoseofDS.com
132
blog.DailyDoseofDS.com
For instance:
To summarize…
133
blog.DailyDoseofDS.com
In Probability:
In Likelihood:
134
blog.DailyDoseofDS.com
This includes:
• column statistics,
• data type info,
• frequency,
• distribution chart, and
• and missing stats.
135
blog.DailyDoseofDS.com
136
blog.DailyDoseofDS.com
In other words, there are always some specific methods that are
most widely used.
Having used NumPy for over 4 years, I can confidently say that
you will use these methods 95% of the time working with
NumPy.
If you are looking for an in-depth guide, you can read my article
on Medium here: Medium NumPy article.
137
blog.DailyDoseofDS.com
138
blog.DailyDoseofDS.com
This involves:
139
blog.DailyDoseofDS.com
When the data comes out of the last hidden layer, and it is
progressing towards the output layer for another transformation,
EVERY activation function that ever existed in the network has
already been utilized.
140
blog.DailyDoseofDS.com
And while progressing from the last hidden layer to the output
layer, the data will pass through one final transformation before it
spits some output.
But given that the transformation from the last hidden layer to the
output layer is entirely linear (or without any activation function),
there is no further scope for non-linearity in the network.
141
blog.DailyDoseofDS.com
To summarize…
While transforming the data through all its hidden layers and just
before reaching the output layer, a neural network is constantly
hustling to project the data to a latent space where it becomes
linearly separable.
Once it does, the output layer can easily handle the data.
142
blog.DailyDoseofDS.com
It’s simple.
143
blog.DailyDoseofDS.com
Feel free to respond with any queries that you may have.
144
blog.DailyDoseofDS.com
145
blog.DailyDoseofDS.com
It is because the log function grows faster for lower values. Thus,
it stretches out the lower values more than the higher values.
146
blog.DailyDoseofDS.com
Graph of log(x)
Thus,
• In case of left-skewness:
147
blog.DailyDoseofDS.com
148
blog.DailyDoseofDS.com
This is because:
149
blog.DailyDoseofDS.com
Here, jitter (strip) plots and KDE plots are immensely helpful.
150
blog.DailyDoseofDS.com
These include:
151
blog.DailyDoseofDS.com
P.S. If the name “Raincloud plot” isn’t obvious yet, it comes from
the visual appearance of the plot:
152
blog.DailyDoseofDS.com
153
blog.DailyDoseofDS.com
154
blog.DailyDoseofDS.com
155
blog.DailyDoseofDS.com
156
blog.DailyDoseofDS.com
157
blog.DailyDoseofDS.com
Evaluate
This adds another metric to my recently proposed methods:
Clustering Performance Without Ground Truth Labels .
The effectiveness of DBCV is also evident from the image below:
This time, the score for the clustering output of KMeans is worse, and
that of density-based clustering is higher.
158
blog.DailyDoseofDS.com
159
blog.DailyDoseofDS.com
160
blog.DailyDoseofDS.com
161
blog.DailyDoseofDS.com
162
blog.DailyDoseofDS.com
163
blog.DailyDoseofDS.com
Leave-One-Out Cross-Validation
164
blog.DailyDoseofDS.com
K-Fold Cross-Validation
Rolling Cross-Validation
165
blog.DailyDoseofDS.com
Blocked Cross-Validation
Stratified Cross-Validation
166
blog.DailyDoseofDS.com
167
blog.DailyDoseofDS.com
" Over to you: What are some other ways you use to prevent decision
trees from overfitting?
168
blog.DailyDoseofDS.com
169
blog.DailyDoseofDS.com
Silhoutte Coefficient:
1. for every point, find average distance to all other points within
its cluster (A)
2. for every point, find average distance to all points in the nearest
cluster (B)
3. score for a point is (B-A)/max(B, A)
4. compute the average of all individual scores to get the overall
clustering score
5. computed on all samples, thus, it's computationally expensive
6. a higher score indicates better and well-separated clusters.
I covered this here if you wish to understand Silhoutte Coefficient with
diagrams: The Limitations Of Elbow Curve And What
You Should Replace It With .
Calinski-Harabasz Index:
170
blog.DailyDoseofDS.com
Silhoutte Coefficient
Calinski-Harabasz Index
Davies-Bouldin Index
" Over to you: What are some other ways to evaluate clustering
performance in such situations?
171
blog.DailyDoseofDS.com
172
blog.DailyDoseofDS.com
The visual presents the syntax comparison of Polars and Pandas for
various operations.
It is clear that Polars API is extremely similar to Pandas'.
Thus, contrary to common belief, the transition from Pandas to Polars
is not that intimidating and tedious.
If you know Pandas, you (mostly) know Polars.
In most cases, the transition will require minimal code updates.
But you get to experience immense speed-ups, which you don't get
with Pandas.
I recently did a comprehensive benchmarking of Pandas and Polars,
which you can read here: Pandas vs Polars — Run-time and
Memory Comparison .
" Over to you: What are some other faster alternatives to Pandas that
you are aware of?
173
blog.DailyDoseofDS.com
174
blog.DailyDoseofDS.com
175
blog.DailyDoseofDS.com
Instead, you can replace them with Dot plots. They are like scatter
plots but with one categorical and one continuous axis.
176
blog.DailyDoseofDS.com
177
blog.DailyDoseofDS.com
178
blog.DailyDoseofDS.com
If you have ever struggled to understand CNN, you should use CNN
Explainer .
It is an incredible interactive tool to visualize the internal workings of
a CNN.
Essentially, you can play around with different layers of a CNN and
visualize how a CNN applies different operations.
179
blog.DailyDoseofDS.com
180
blog.DailyDoseofDS.com
181
blog.DailyDoseofDS.com
In the image above, the scale of Income could massively impact the
overall prediction. Scaling (or standardizing) the data to a similar
range can mitigate this and improve the model’s performance.
Yet, contrary to common belief, they NEVER change the underlying
distribution.
Instead, they just alter the range of values.
Thus:
1. Normal distribution → stays Normal
2. Uniform distribution → stays Uniform
3. Skewed distribution → stays Skewed
4. and so on…
We can also verify this from the below illustration:
182
blog.DailyDoseofDS.com
183
blog.DailyDoseofDS.com
184
blog.DailyDoseofDS.com
185
blog.DailyDoseofDS.com
186
blog.DailyDoseofDS.com
187
blog.DailyDoseofDS.com
188
blog.DailyDoseofDS.com
189