week_11_features_categorical
week_11_features_categorical
We need these new types of features because ULTIMATELY linear models require
MULTIPYLING SLOPES with NUMERIC INPUTS!!!!!!
Import Modules
In [1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Read data
In [3]: df = pd.read_csv('week_11_categorical_input.csv')
In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 x 155 non-null object
1 y 155 non-null float64
dtypes: float64(1), object(1)
memory usage: 2.5+ KB
In [5]: df.nunique()
Out[5]: x 10
y 155
dtype: int64
In [6]: df.isna().sum()
Out[6]: x 0
y 0
dtype: int64
1 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
Explore data
In [8]: sns.displot(data = df, x='y', kind='hist')
plt.show()
plt.show()
2 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
plt.show()
3 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
Let's now examine if the AVERAGE OUTPUT is different across the categories of the
categorical input!!!
plt.show()
4 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
Why does this matter when we are talking about LINEAR MODELS?!?!?!
We have learned that LINEAR MODELS are fundamentally MODELING the AVERAGE
OUTPUT!!!!!!
When you have categorical inputs...the goal of the LINEAR MODEL is to PREDICT the
AVERAGE OUTPUT PER CATEGORY!!!!!!!!
In [12]: df
5 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
Out[12]: x y
0 A 103.324129
1 G 96.530846
2 D 95.772873
3 C 100.190541
4 J 129.608242
150 B 92.286905
151 I 74.533121
152 F 46.537280
153 C 83.826336
154 E 176.863684
In [14]: fit_a.params
In [15]: fit_a.params.size
Out[15]: 10
6 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
But when you have NON-NUMERIC inputs...FEATURES must be DERIVED from the INPUT to
allow MULTIPLYING the FEATURE with a SLOPE!!!
If the DUMMY is equal to 1, then the categorical variable EQUALS that category.
If the DUMMY does NOT equal 1 or equals 0, then the categorical variable does NOT equal
that category!!!
In [16]: df.x.nunique()
Out[16]: 10
In [17]: df.x.nunique() - 1
Out[17]: 9
In [19]: df.groupby('x').\
aggregate(avg_y = ('y', 'mean')).\
reset_index()
7 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
Out[19]: x avg_y
0 A 98.727693
1 B 103.201995
2 C 101.903418
3 D 100.522801
4 E 153.381749
5 F 53.921726
6 G 97.511650
7 H 99.046431
8 I 85.686113
9 J 119.564021
In [21]: avg_y_at_xA
Out[21]: 98.72769293463894
In [23]: df_summary
Out[23]: x avg_y
0 A 98.727693
1 B 103.201995
2 C 101.903418
3 D 100.522801
4 E 153.381749
5 F 53.921726
6 G 97.511650
7 H 99.046431
8 I 85.686113
9 J 119.564021
In [25]: df_summary
8 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
0 A 98.727693 98.727693
1 B 103.201995 98.727693
2 C 101.903418 98.727693
3 D 100.522801 98.727693
4 E 153.381749 98.727693
5 F 53.921726 98.727693
6 G 97.511650 98.727693
7 H 99.046431 98.727693
8 I 85.686113 98.727693
9 J 119.564021 98.727693
In [28]: df_summary.round(3)
In [29]: fit_a.params
9 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
When DUMMY VARIABLES are DEFINED a REFERENCE category must be DEFINED!!! All
DUMMY SLOPES are calculated RELATIVE to this REFERENCE POINT!!! By default the
REFERENCE CATEGORY is the FIRST alphabetical category!!!
In [30]: fit_a.pvalues
ax.errorbar( y=mod.params.index,
x=mod.params,
xerr = 2 * mod.bse,
fmt='o', color='k', ecolor='k', elinewidth=2, ms=10)
ax.set_xlabel('coefficient value')
plt.show()
10 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
Predictions
If we predict with our MODEL it will ONLY predict at the observed CATEGORIES!!!!
In [45]: input_grid
Out[45]: x
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
11 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
Out[46]: 0 98.727693
1 103.201995
2 101.903418
3 100.522801
4 153.381749
5 53.921726
6 97.511650
7 99.046431
8 85.686113
9 119.564021
dtype: float64
In [48]: df_summary
An alternative
An alternative approach is to REMOVE the intercept. This is known as the ONE-HOT
encoding!!!
In [50]: fit_b.params
12 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
In [51]: fit_a.params
The estimated coefficients when ONE HOT encoding is used...corresponds DIRECTLY to the
AVERAGE OUTPUT per category!!
In [56]: fit_b.params.reset_index(drop=True)
Out[56]: 0 98.727693
1 103.201995
2 101.903418
3 100.522801
4 153.381749
5 53.921726
6 97.511650
7 99.046431
8 85.686113
9 119.564021
dtype: float64
In [59]: df_summary
13 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
In [61]: df_summary
14 of 15 11/16/2024, 10:26 AM
week_11_features_categorical https://d3c33hcgiwev3.cloudfront.net/OCqbPCBISpGanADuOpf08Q_...
In [63]: fit_b.pvalues
One Hot encoding causes you to NO LONGER be able to us the p-value to identify statistical
significance!!!!!!!!!!!!
15 of 15 11/16/2024, 10:26 AM