0% found this document useful (0 votes)

5 views

hw-3

The document is a homework assignment for Math 189, focusing on linear regression analysis using a dataset of housing prices in California. It includes instructions for submission, data preparation, model fitting, and evaluation of model assumptions, as well as variance inflation factor calculations to address multicollinearity. The student, Dylan Oquendo, is required to analyze the data and provide insights based on the regression results.

Uploaded by

dande.t.lion

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

hw-3

Uploaded by

dande.t.lion

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

3/17/24, 12:00 AM hw-3

HW-3 • Math 189 • Wi 2024

Due Date: Sat, Mar 16th 2024
NAME: Dylan Oquendo

PID: A17054351

Instructions
Submit your solutions online on Gradescope
Look at the detailed instructions here
I certify that the following write-up is my own work, and have abided by the UCSD Academic Integrity Guidelines.
Yes
No

In [ ]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sklearn

## Configurations
%matplotlib inline

file:///Users/dylanoquendo/Downloads/hw-3.html 1/20
3/17/24, 12:00 AM hw-3

/var/folders/px/_70k765d22d1hlv8t1lktgw40000gn/T/ipykernel_19238/1996746605.py:2: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other
libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

import pandas as pd

Question 1
Linear regression
The data folder contains the housing.csv dataset which contains housing prices in California from the 1990 California
census. The objective is to predict the median house price for California districts based on various features. The features are
the following:
1. longitude : A measure of how far west a house is; a higher value is farther west
2. latitude : A measure of how far north a house is; a higher value is farther north
3. housing_median_age : Median age of a house within a block; a lower number is a newer building
4. total_rooms : Total number of rooms within a block
5. total_bedrooms : Total number of bedrooms within a block
6. population : Total number of people residing within a block
7. households : Total number of households, a group of people residing within a home unit, for a block
8. median_income : Median income for households within a block of houses
9. median_house_value : Median house value for households within a block
10. ocean_proximity : Location of the house w.r.t ocean/sea
a. Load the dataset and display the first 5 rows of the dataset.
In [ ]: path = 'https://raw.githubusercontent.com/ucsd-math189-wi24/materials/main/data/housing.csv'
df = pd.read_csv(path)

file:///Users/dylanoquendo/Downloads/hw-3.html 2/20
3/17/24, 12:00 AM hw-3

df.head()

Out[ ]: longitude latitude housing_median_age total_rooms total_bedrooms population households median_income med
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462

b. Describe the data type (e.g., categorical, discrete quantitative, etc.) of each variable in the dataset. If you
identify any categorical variables, explicitly convert them to categorical variables in your pandas dataframe.

longitude is continuous quantitative

latitude is continuous quantitative
housing_median_age is discrete quantitative
total_rooms is discrete quantitative
population is discrete quantitative
households is discrete quantitative
median_income is continuous quantitative
median_house_value is discrete quantitative
ocean_proximity is categorical
In [ ]: df.ocean_proximity = df.ocean_proximity.astype('category')

In [ ]: df.dtypes

file:///Users/dylanoquendo/Downloads/hw-3.html 3/20
3/17/24, 12:00 AM hw-3

Out[ ]: longitude float64

latitude float64
housing_median_age float64
total_rooms float64
total_bedrooms float64
population float64
households float64
median_income float64
median_house_value float64
ocean_proximity category
dtype: object

c. Fit a linear regression model to predict the median_house_value based on all other covariates.
In [ ]: full_model = smf.ols(formula="median_house_value ~ longitude + latitude + housing_median_age + total_rooms
print(full_model.summary())

file:///Users/dylanoquendo/Downloads/hw-3.html 4/20
3/17/24, 12:00 AM hw-3

OLS Regression Results

==============================================================================
Dep. Variable: median_house_value R-squared: 0.646
Model: OLS Adj. R-squared: 0.646
Method: Least Squares F-statistic: 3112.
Date: Sat, 16 Mar 2024 Prob (F-statistic): 0.00
Time: 22:51:58 Log-Likelihood: -2.5655e+05
No. Observations: 20433 AIC: 5.131e+05
Df Residuals: 20420 BIC: 5.132e+05
Df Model: 12
Covariance Type: nonrobust
=================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------
Intercept -2.27e+06 8.8e+04 -25.791 0.000 -2.44e+06 -2.1e+06
ocean_proximity[T.INLAND] -3.928e+04 1744.258 -22.522 0.000 -4.27e+04 -3.59e+04
ocean_proximity[T.ISLAND] 1.529e+05 3.07e+04 4.974 0.000 9.26e+04 2.13e+05
ocean_proximity[T.NEAR BAY] -3954.0516 1913.339 -2.067 0.039 -7704.350 -203.753
ocean_proximity[T.NEAR OCEAN] 4278.1343 1569.525 2.726 0.006 1201.739 7354.530
longitude -2.681e+04 1019.651 -26.296 0.000 -2.88e+04 -2.48e+04
latitude -2.548e+04 1004.702 -25.363 0.000 -2.75e+04 -2.35e+04
housing_median_age 1072.5200 43.886 24.439 0.000 986.501 1158.540
total_rooms -6.1933 0.791 -7.825 0.000 -7.745 -4.642
total_bedrooms 100.5563 6.869 14.640 0.000 87.093 114.019
population -37.9691 1.076 -35.282 0.000 -40.078 -35.860
households 49.6173 7.451 6.659 0.000 35.012 64.222
median_income 3.926e+04 338.005 116.151 0.000 3.86e+04 3.99e+04
==============================================================================
Omnibus: 5049.292 Durbin-Watson: 0.977
Prob(Omnibus): 0.000 Jarque-Bera (JB): 19123.138
Skew: 1.197 Prob(JB): 0.00
Kurtosis: 7.090 Cond. No. 7.24e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.24e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

In [ ]: residuals = full_model.resid

# Plot the residuals

file:///Users/dylanoquendo/Downloads/hw-3.html 5/20
3/17/24, 12:00 AM hw-3

plt.figure(figsize=(8, 6))
plt.scatter(full_model.fittedvalues, residuals, alpha=0.5)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.axhline(y=0, color='r', linestyle='--')
plt.grid(True)
plt.show()

file:///Users/dylanoquendo/Downloads/hw-3.html 6/20
3/17/24, 12:00 AM hw-3

d. Based on the summary of the linear regression model, do you think this is a good fit for the data? Explain your
answer.
No, since the condition number was extremely large, and after plotting the residuals, there is a clear pattern present.
e. Comment on the model assumptions and to what extent they are satisfied or not satisfied.

For smf OLS regression, we assume that the relationship between variables is linear, meaning that changes in one
variable are consistently associated with changes in another. We also assume that the errors in our predictions are
independent, meaning one error doesn't predict another, and that the spread of these errors remains consistent across all
values of the independent variables. Additionally, we assume that the errors follow a normal distribution, that there is no
perfect relationship between independent variables (multicollinearity), and that errors don't follow a pattern over time or
across observations (autocorrelation).
Based on summary notes, there seems to be a violation of multicollinearity, and based on my risidual plot, there appears
to be a pattern, violating the assumption of homoscedasticity
f. Compute the variance inflation factor (VIF) for each covariate. What do you observe?
In [ ]: exog = full_model.model.exog
names = full_model.params.index
for i in range(1, exog.shape[1]):
print(f'VIF: {names[i]}: {variance_inflation_factor(exog, i): .3f}')

file:///Users/dylanoquendo/Downloads/hw-3.html 7/20
3/17/24, 12:00 AM hw-3

VIF: ocean_proximity[T.INLAND]: 2.860

VIF: ocean_proximity[T.ISLAND]: 1.002
VIF: ocean_proximity[T.NEAR BAY]: 1.567
VIF: ocean_proximity[T.NEAR OCEAN]: 1.197
VIF: longitude: 18.091
VIF: latitude: 19.969
VIF: housing_median_age: 1.324
VIF: total_rooms: 12.966
VIF: total_bedrooms: 36.310
VIF: population: 6.446
VIF: households: 35.173
VIF: median_income: 1.786

g. Drop the covariate(s) with a variance inflation factor greater than 5 and fit the linear regression model again.
In [ ]: full_model = smf.ols(formula="median_house_value ~ housing_median_age + median_income + ocean_proximity", d
print(full_model.summary())

file:///Users/dylanoquendo/Downloads/hw-3.html 8/20
3/17/24, 12:00 AM hw-3

OLS Regression Results

==============================================================================
Dep. Variable: median_house_value R-squared: 0.597
Model: OLS Adj. R-squared: 0.597
Method: Least Squares F-statistic: 5092.
Date: Sat, 16 Mar 2024 Prob (F-statistic): 0.00
Time: 22:53:37 Log-Likelihood: -2.6049e+05
No. Observations: 20640 AIC: 5.210e+05
Df Residuals: 20633 BIC: 5.211e+05
Df Model: 6
Covariance Type: nonrobust
=================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------
Intercept 5.151e+04 2049.957 25.130 0.000 4.75e+04 5.55e+04
ocean_proximity[T.INLAND] -7.165e+04 1249.504 -57.345 0.000 -7.41e+04 -6.92e+04
ocean_proximity[T.ISLAND] 1.849e+05 3.28e+04 5.640 0.000 1.21e+05 2.49e+05
ocean_proximity[T.NEAR BAY] 1.35e+04 1750.783 7.711 0.000 1.01e+04 1.69e+04
ocean_proximity[T.NEAR OCEAN] 1.787e+04 1616.074 11.057 0.000 1.47e+04 2.1e+04
housing_median_age 926.8172 43.459 21.326 0.000 841.635 1011.999
median_income 3.816e+04 281.715 135.448 0.000 3.76e+04 3.87e+04
==============================================================================
Omnibus: 4646.761 Durbin-Watson: 0.815
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12388.581
Skew: 1.211 Prob(JB): 0.00
Kurtosis: 5.922 Cond. No. 2.02e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.02e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

h. Based on the summary of the regression model new_model interpret the coefficients of the covariates.

Intercept: The intercept coefficient represents the predicted median house value when all other predictors are zero. In
this case, it's $51,510.
ocean_proximity Categories:
file:///Users/dylanoquendo/Downloads/hw-3.html 9/20
3/17/24, 12:00 AM hw-3

For each category of ocean_proximity, such as INLAND, ISLAND, NEAR BAY, and NEAR OCEAN, the coefficients
represent the difference in median house value compared to the reference category (which is not shown in the
summary but would be one of the categories if the variable is categorical with multiple levels).
For example, the coefficient for ocean_proximity[T.INLAND] is -
71, 650.T hismeansthat, onaverage, housesinlandhaveamedianvaluethatis 71,650 lower than the reference
category.
Similarly, the coefficients for ocean_proximity[T.ISLAND], ocean_proximity[T.NEAR BAY], and
ocean_proximity[T.NEAR OCEAN] indicate the difference in median house value for houses located on islands, near
the bay, and near the ocean, respectively, compared to the reference category.
housing_median_age: For every one-unit increase in housing median age, the median house value is expected to
increase by $926.82, on average.
median_income: For every one-unit increase in median income, the median house value is expected to increase by
$38,160, on average.

i. Holding all other covariates constant, which of the ocean_proximity categories do you expect to find a
house with the highest median house value? Why?
In [ ]: df.ocean_proximity.value_counts()

Out[ ]: ocean_proximity
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: count, dtype: int64

The island one, since the coefficient is 1.849e+05, which means that the median_house_value in island areas would have
about $184,900 higher value than those in the <1H OCEAN home area. And this coefficient is the largest of all the categories.

file:///Users/dylanoquendo/Downloads/hw-3.html 10/20
3/17/24, 12:00 AM hw-3

Question 2
For this question, we are going to use the abortion dataset which consists of Abortion Opinions in the General Social
Survey (GSS) from 1977 to 2018. The article related to the dataset can be found here.
The data has been preprocessed and is available in the data folder as abortion.csv . The dataset contains the following
columns:
1. abortion : Do you think that abortion should be legal for any reason?
2. year : Year of the survey
3. age : Respondent's age
4. sex : Respondent's sex
5. race : Respondent's race
6. education : How many years of education has the respondent completed
7. relactiv : Self-reported religiosity
8. pid : Respondent's political party identification (0: strong democrat ... 6: strong republican)

a. Load the dataset and display the first 5 rows of the dataset.
In [ ]: path = 'https://raw.githubusercontent.com/ucsd-math189-wi24/materials/main/data/abortion.csv'
df = pd.read_csv(path)
df.head()

file:///Users/dylanoquendo/Downloads/hw-3.html 11/20
3/17/24, 12:00 AM hw-3

Out[ ]: year age race sex educ relactiv pid abortion

0 2006 50.0 Black Female 13.0 4.0 0.0 1.0
1 2006 50.0 Black Female 12.0 1.0 0.0 1.0
2 2006 20.0 Black Male 14.0 1.0 0.0 1.0
3 2006 29.0 Black Female 12.0 1.0 3.0 1.0
4 2006 23.0 Black Female 16.0 1.0 0.0 1.0

b. Summarize the data type (e.g., categorical, discrete quantitative, etc.) of each variable in the dataset. If you
identify any categorical variables, explicitly convert them to categorical variables in your pandas dataframe.

year is categorical
age is discrete quantitative
race is categorical
sex is categorical
educ is discrete quantitative
relactiv is categorical
pid is categorical
abortion is categorical
In [ ]: df.year = df.year.astype('category')
df.race = df.race.astype('category')
df.sex = df.sex.astype('category')
df.relactiv = df.relactiv.astype('category')
df.pid = df.pid.astype('category')
df.abortion = df.abortion.astype('category')

c. Fit a logistic regression model to predict the abortion based on all other covariates.
In [ ]: df['abortion'] = df['abortion'].cat.codes

file:///Users/dylanoquendo/Downloads/hw-3.html 12/20
3/17/24, 12:00 AM hw-3

df.dtypes

In [ ]: model1 = smf.logit('abortion ~ year + age + race+ sex + educ + relactiv + pid', data=df).fit()
print(model1.summary())

file:///Users/dylanoquendo/Downloads/hw-3.html 13/20
3/17/24, 12:00 AM hw-3

Optimization terminated successfully.

Current function value: 0.282245
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: abortion No. Observations: 10133
Model: Logit Df Residuals: 10106
Method: MLE Df Model: 26
Date: Sat, 16 Mar 2024 Pseudo R-squ.: 0.1299
Time: 23:56:21 Log-Likelihood: -2860.0
converged: True LL-Null: -3286.8
Covariance Type: nonrobust LLR p-value: 3.396e-163
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 1.5849 0.242 6.539 0.000 1.110 2.060
year[T.2008] 0.0264 0.130 0.203 0.839 -0.229 0.282
year[T.2010] -0.0550 0.130 -0.422 0.673 -0.310 0.200
year[T.2012] -0.1799 0.127 -1.414 0.157 -0.429 0.069
year[T.2014] -0.0558 0.121 -0.461 0.645 -0.293 0.182
year[T.2016] -0.0979 0.117 -0.837 0.403 -0.327 0.131
year[T.2018] 0.0730 0.127 0.573 0.566 -0.177 0.323
race[T.Other] -0.2239 0.142 -1.573 0.116 -0.503 0.055
race[T.White] 0.0364 0.114 0.321 0.748 -0.186 0.259
sex[T.Male] 0.2191 0.073 3.017 0.003 0.077 0.361
relactiv[T.2.0] 0.2103 0.198 1.063 0.288 -0.177 0.598
relactiv[T.3.0] -0.4240 0.124 -3.432 0.001 -0.666 -0.182
relactiv[T.4.0] -1.1479 0.113 -10.200 0.000 -1.368 -0.927
relactiv[T.5.0] -0.9510 0.141 -6.761 0.000 -1.227 -0.675
relactiv[T.6.0] -1.5177 0.127 -11.937 0.000 -1.767 -1.268
relactiv[T.7.0] -1.8855 0.165 -11.432 0.000 -2.209 -1.562
relactiv[T.8.0] -1.7725 0.117 -15.204 0.000 -2.001 -1.544
relactiv[T.9.0] -2.0374 0.297 -6.852 0.000 -2.620 -1.455
relactiv[T.10.0] -1.4981 0.393 -3.807 0.000 -2.269 -0.727
pid[T.1.0] -0.1959 0.152 -1.290 0.197 -0.493 0.102
pid[T.2.0] 0.0460 0.179 0.257 0.797 -0.305 0.397
pid[T.3.0] -0.8528 0.139 -6.143 0.000 -1.125 -0.581
pid[T.4.0] -1.1361 0.157 -7.259 0.000 -1.443 -0.829
pid[T.5.0] -0.9807 0.146 -6.695 0.000 -1.268 -0.694
pid[T.6.0] -1.5697 0.143 -10.961 0.000 -1.850 -1.289
age 0.0069 0.002 3.244 0.001 0.003 0.011

file:///Users/dylanoquendo/Downloads/hw-3.html 14/20
3/17/24, 12:00 AM hw-3

educ 0.1259 0.012 10.938 0.000 0.103 0.148

====================================================================================

d. Identify the covariates which are statistically significant at a 15% significance level.
Race[T.Other], sex[T.Male], all the relactiv, all the pid, age, and educ were significant
e. Based on the variables you identified in part d, fit a new logistic regression model only including those
covariates.
In [ ]: model2 = smf.logit('abortion ~ age + race + sex + educ + relactiv + pid', data=df).fit()
print(model2.summary())

file:///Users/dylanoquendo/Downloads/hw-3.html 15/20
3/17/24, 12:00 AM hw-3

Optimization terminated successfully.

Current function value: 0.282467
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: abortion No. Observations: 10133
Model: Logit Df Residuals: 10112
Method: MLE Df Model: 20
Date: Sat, 16 Mar 2024 Pseudo R-squ.: 0.1292
Time: 23:56:24 Log-Likelihood: -2862.2
converged: True LL-Null: -3286.8
Covariance Type: nonrobust LLR p-value: 5.152e-167
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 1.5504 0.233 6.667 0.000 1.095 2.006
race[T.Other] -0.2210 0.142 -1.555 0.120 -0.500 0.058
race[T.White] 0.0383 0.113 0.337 0.736 -0.184 0.261
sex[T.Male] 0.2159 0.073 2.976 0.003 0.074 0.358
relactiv[T.2.0] 0.2103 0.198 1.064 0.287 -0.177 0.598
relactiv[T.3.0] -0.4260 0.123 -3.450 0.001 -0.668 -0.184
relactiv[T.4.0] -1.1467 0.112 -10.203 0.000 -1.367 -0.926
relactiv[T.5.0] -0.9512 0.141 -6.769 0.000 -1.227 -0.676
relactiv[T.6.0] -1.5176 0.127 -11.950 0.000 -1.767 -1.269
relactiv[T.7.0] -1.8906 0.165 -11.463 0.000 -2.214 -1.567
relactiv[T.8.0] -1.7721 0.116 -15.219 0.000 -2.000 -1.544
relactiv[T.9.0] -2.0340 0.297 -6.849 0.000 -2.616 -1.452
relactiv[T.10.0] -1.4910 0.394 -3.787 0.000 -2.263 -0.719
pid[T.1.0] -0.1990 0.152 -1.312 0.189 -0.496 0.098
pid[T.2.0] 0.0483 0.179 0.270 0.787 -0.302 0.399
pid[T.3.0] -0.8510 0.139 -6.139 0.000 -1.123 -0.579
pid[T.4.0] -1.1319 0.156 -7.245 0.000 -1.438 -0.826
pid[T.5.0] -0.9770 0.146 -6.680 0.000 -1.264 -0.690
pid[T.6.0] -1.5627 0.143 -10.926 0.000 -1.843 -1.282
age 0.0068 0.002 3.209 0.001 0.003 0.011
educ 0.1255 0.011 10.940 0.000 0.103 0.148
====================================================================================

f. Include an interaction term between sex and pid in your logistic regression model.

file:///Users/dylanoquendo/Downloads/hw-3.html 16/20
3/17/24, 12:00 AM hw-3

In [ ]: df['sex'] = df['sex'].cat.codes
df['pid'] = df['pid'].cat.codes
df['sex_pid_interaction'] = df['sex'] * df['pid']

In [ ]: model3 = smf.logit('abortion ~ age + race + sex + educ + relactiv + pid + sex_pid_interaction', data=df).fi
print(model2.summary())
print(model3.summary())

file:///Users/dylanoquendo/Downloads/hw-3.html 17/20
3/17/24, 12:00 AM hw-3

Optimization terminated successfully.

Current function value: 0.283887
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: abortion No. Observations: 10133
Model: Logit Df Residuals: 10112
Method: MLE Df Model: 20
Date: Sat, 16 Mar 2024 Pseudo R-squ.: 0.1292
Time: 23:57:01 Log-Likelihood: -2862.2
converged: True LL-Null: -3286.8
Covariance Type: nonrobust LLR p-value: 5.152e-167
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 1.5504 0.233 6.667 0.000 1.095 2.006
race[T.Other] -0.2210 0.142 -1.555 0.120 -0.500 0.058
race[T.White] 0.0383 0.113 0.337 0.736 -0.184 0.261
sex[T.Male] 0.2159 0.073 2.976 0.003 0.074 0.358
relactiv[T.2.0] 0.2103 0.198 1.064 0.287 -0.177 0.598
relactiv[T.3.0] -0.4260 0.123 -3.450 0.001 -0.668 -0.184
relactiv[T.4.0] -1.1467 0.112 -10.203 0.000 -1.367 -0.926
relactiv[T.5.0] -0.9512 0.141 -6.769 0.000 -1.227 -0.676
relactiv[T.6.0] -1.5176 0.127 -11.950 0.000 -1.767 -1.269
relactiv[T.7.0] -1.8906 0.165 -11.463 0.000 -2.214 -1.567
relactiv[T.8.0] -1.7721 0.116 -15.219 0.000 -2.000 -1.544
relactiv[T.9.0] -2.0340 0.297 -6.849 0.000 -2.616 -1.452
relactiv[T.10.0] -1.4910 0.394 -3.787 0.000 -2.263 -0.719
pid[T.1.0] -0.1990 0.152 -1.312 0.189 -0.496 0.098
pid[T.2.0] 0.0483 0.179 0.270 0.787 -0.302 0.399
pid[T.3.0] -0.8510 0.139 -6.139 0.000 -1.123 -0.579
pid[T.4.0] -1.1319 0.156 -7.245 0.000 -1.438 -0.826
pid[T.5.0] -0.9770 0.146 -6.680 0.000 -1.264 -0.690
pid[T.6.0] -1.5627 0.143 -10.926 0.000 -1.843 -1.282
age 0.0068 0.002 3.209 0.001 0.003 0.011
educ 0.1255 0.011 10.940 0.000 0.103 0.148
====================================================================================
Logit Regression Results
==============================================================================
Dep. Variable: abortion No. Observations: 10133
Model: Logit Df Residuals: 10116
Method: MLE Df Model: 16

file:///Users/dylanoquendo/Downloads/hw-3.html 18/20
3/17/24, 12:00 AM hw-3

Date: Sat, 16 Mar 2024 Pseudo R-squ.: 0.1248

Time: 23:57:01 Log-Likelihood: -2876.6
converged: True LL-Null: -3286.8
Covariance Type: nonrobust LLR p-value: 2.856e-164
=======================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 1.6479 0.210 7.832 0.000 1.235 2.060
race[T.Other] -0.2124 0.140 -1.515 0.130 -0.487 0.062
race[T.White] 0.0698 0.112 0.624 0.533 -0.150 0.289
relactiv[T.2.0] 0.2192 0.197 1.111 0.267 -0.168 0.606
relactiv[T.3.0] -0.4243 0.123 -3.446 0.001 -0.666 -0.183
relactiv[T.4.0] -1.1445 0.112 -10.227 0.000 -1.364 -0.925
relactiv[T.5.0] -0.9485 0.140 -6.775 0.000 -1.223 -0.674
relactiv[T.6.0] -1.5223 0.127 -12.031 0.000 -1.770 -1.274
relactiv[T.7.0] -1.8796 0.164 -11.432 0.000 -2.202 -1.557
relactiv[T.8.0] -1.7555 0.116 -15.178 0.000 -1.982 -1.529
relactiv[T.9.0] -2.0614 0.296 -6.972 0.000 -2.641 -1.482
relactiv[T.10.0] -1.5255 0.393 -3.885 0.000 -2.295 -0.756
age 0.0066 0.002 3.170 0.002 0.003 0.011
sex 0.0155 0.147 0.105 0.916 -0.273 0.304
educ 0.1292 0.011 11.473 0.000 0.107 0.151
pid -0.2825 0.024 -11.645 0.000 -0.330 -0.235
sex_pid_interaction 0.0587 0.037 1.593 0.111 -0.014 0.131
=======================================================================================

g. Is there sufficient evidence to conclude that the sex moderates the effect of pid on abortion opinion?
Explain your answer.
The coefficient of 0.0587 suggests that there is a positive relationship between the interaction of 'sex' and 'pid' with the
abortion opinion. However, the p-value of 0.111 indicates that this coefficient is not statistically significant at conventional
levels (e.g., α = 0.05).
h. Interpret each coefficient associated with the covariates in the new logistic regression model, model3 .

file:///Users/dylanoquendo/Downloads/hw-3.html 19/20
3/17/24, 12:00 AM hw-3

Categorical Variables (Race and Sex): For categorical variables like race and sex, each coefficient represents the
difference in log-odds of the outcome relative to the reference category. For example:
For race: The coefficient for 'race[T.Other]' (-0.2210) represents the difference in log-odds of the outcome between
the 'Other' race category and the reference category (e.g., 'White' race), holding other variables constant.
For sex: The coefficient for 'sex[T.Male]' (0.2159) represents the difference in log-odds of the outcome between
males and females, holding other variables constant.
Ordinal Variables (Education, PID): For ordinal variables like education and PID (political identification), each
coefficient represents the change in log-odds of the outcome associated with a one-unit increase in the predictor,
holding other variables constant. For example:
For education: The coefficient of 0.1292 indicates that for each one-unit increase in education level, the log-odds of the
outcome increase by 0.1292, holding other variables constant.
For PID: The coefficient of -0.2825 indicates that for each one-unit increase in political identification, the log-odds of the
outcome decrease by 0.2825, holding other variables constant.
Interaction Term (Sex_Pid_Interaction): The coefficient for the interaction term represents the change in the log-odds of
the outcome associated with the interaction between 'sex' and 'pid' variables. In this case, the coefficient of 0.0587
implies the change in the log-odds of the outcome for each unit change in the interaction between 'sex' and 'pid', holding
other variables constant.

i. Print the confusion matrix and report the classification accuracy of model3 .
In [ ]: ...

j. Plot the ROC curve and compute the AUC of model3 .

In [ ]: ...

file:///Users/dylanoquendo/Downloads/hw-3.html 20/20

IKM
100% (1)
IKM
123 pages
Eva-200 201 210 Ome 070515
No ratings yet
Eva-200 201 210 Ome 070515
36 pages
4.2.6 Lab - Working With Text Files in The CLI
No ratings yet
4.2.6 Lab - Working With Text Files in The CLI
10 pages
Assignment 1:: Intro To Machine Learning
No ratings yet
Assignment 1:: Intro To Machine Learning
6 pages
Estimating Cragg Appendix PDF
100% (1)
Estimating Cragg Appendix PDF
5 pages
Assignment Cover Sheet University of Sunderland Ba (Hons) Business Management
0% (1)
Assignment Cover Sheet University of Sunderland Ba (Hons) Business Management
32 pages
Philips PCR Eleva S Plus Product Overview
No ratings yet
Philips PCR Eleva S Plus Product Overview
4 pages
Regression Anallysis Hands0n 1
100% (1)
Regression Anallysis Hands0n 1
3 pages
Data Science Record_05
No ratings yet
Data Science Record_05
20 pages
Week 2 MrSumanBera HandsOn
No ratings yet
Week 2 MrSumanBera HandsOn
9 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
5 pages
Normialization Dataset
No ratings yet
Normialization Dataset
7 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
f3683849-7ca6-4854-8f96-af11b6e837ec
No ratings yet
f3683849-7ca6-4854-8f96-af11b6e837ec
20 pages
machinelearning
No ratings yet
machinelearning
26 pages
02 End To End Machine Learning Project
No ratings yet
02 End To End Machine Learning Project
26 pages
TP Regression
100% (1)
TP Regression
1 page
Bda Assign
No ratings yet
Bda Assign
15 pages
Praveen Ai
No ratings yet
Praveen Ai
6 pages
Assignment_Solution_1
No ratings yet
Assignment_Solution_1
11 pages
Unit 5
No ratings yet
Unit 5
171 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Copy of Project 4 _ House Price Prediction.ipynb - Colab
No ratings yet
Copy of Project 4 _ House Price Prediction.ipynb - Colab
5 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Coding Activity 3.ipynb - Colaboratory
No ratings yet
Coding Activity 3.ipynb - Colaboratory
7 pages
Python_Codes_Regression - Jupyter Notebook
No ratings yet
Python_Codes_Regression - Jupyter Notebook
7 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
Pro Yec To Machine Learning
No ratings yet
Pro Yec To Machine Learning
35 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
TestExercise 3.ipynb - Colab
No ratings yet
TestExercise 3.ipynb - Colab
8 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
EX. NO: 3 Performing Statistical Analysis On A Dataset DATE: 21/08/2024
No ratings yet
EX. NO: 3 Performing Statistical Analysis On A Dataset DATE: 21/08/2024
8 pages
T2_summary_VHA
No ratings yet
T2_summary_VHA
14 pages
Lab 3 - Linear Regression
No ratings yet
Lab 3 - Linear Regression
15 pages
DA Manual - Part B
No ratings yet
DA Manual - Part B
13 pages
GianluigiDeRubertis 228766
No ratings yet
GianluigiDeRubertis 228766
9 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
Exercise4 Solution
No ratings yet
Exercise4 Solution
20 pages
Train
No ratings yet
Train
17 pages
Simple_and_Multiple_Regression
No ratings yet
Simple_and_Multiple_Regression
9 pages
01.multiple Linear Regression - Ipynb - Colaboratory
No ratings yet
01.multiple Linear Regression - Ipynb - Colaboratory
10 pages
AI Lec 3
No ratings yet
AI Lec 3
36 pages
Pregunta 5
No ratings yet
Pregunta 5
2 pages
Regression Algorithm
No ratings yet
Regression Algorithm
9 pages
snt7
No ratings yet
snt7
13 pages
Module 2
No ratings yet
Module 2
20 pages
ml2020 Pythonlab02
No ratings yet
ml2020 Pythonlab02
3 pages
Lecture-3---Linear-Regression-imran-20022025-092939am
No ratings yet
Lecture-3---Linear-Regression-imran-20022025-092939am
46 pages
Emllab
No ratings yet
Emllab
6 pages
ML Final Prac
No ratings yet
ML Final Prac
47 pages
male
No ratings yet
male
4 pages
5103A1
No ratings yet
5103A1
6 pages
Import As Import As From Import: "Mean Squared Errors: "
No ratings yet
Import As Import As From Import: "Mean Squared Errors: "
1 page
ML Lab-3
No ratings yet
ML Lab-3
14 pages
Exercise 4: Simple and Multiple Linear Regression Analysis
No ratings yet
Exercise 4: Simple and Multiple Linear Regression Analysis
15 pages
vertopal.com_Lab_Linear_Regression
No ratings yet
vertopal.com_Lab_Linear_Regression
21 pages
Kaggle Machine Learning
No ratings yet
Kaggle Machine Learning
6 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
Ex No.: Date: Problem Statement
No ratings yet
Ex No.: Date: Problem Statement
3 pages
message (3)
No ratings yet
message (3)
2 pages
Multicollinearity and Oaxaca -Tutorial
No ratings yet
Multicollinearity and Oaxaca -Tutorial
35 pages
Group Work Assignment Supervised and Unsupervised Learning
No ratings yet
Group Work Assignment Supervised and Unsupervised Learning
10 pages
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
Assignment # 1
No ratings yet
Assignment # 1
4 pages
Artificial Intelligence and Librarianship, Martin Frické
No ratings yet
Artificial Intelligence and Librarianship, Martin Frické
533 pages
Statistical Graphics Procedures by Example Effective Graphs Using SAS by Sanjay Matange, Dan Heath
No ratings yet
Statistical Graphics Procedures by Example Effective Graphs Using SAS by Sanjay Matange, Dan Heath
371 pages
AI Question Bank
No ratings yet
AI Question Bank
13 pages
DJJ 10033-Chapter 6 New
No ratings yet
DJJ 10033-Chapter 6 New
43 pages
The Impact of Ebooks On The Reading Motivation and Reading Skills of Children and Young People
No ratings yet
The Impact of Ebooks On The Reading Motivation and Reading Skills of Children and Young People
21 pages
RIICWD533E_CAC_Assessment_Template__1_.docx
No ratings yet
RIICWD533E_CAC_Assessment_Template__1_.docx
25 pages
Report Indian Flag
No ratings yet
Report Indian Flag
7 pages
WL4000
100% (3)
WL4000
1 page
Cambridge IGCSE™ (9-1) : Information and Communication Technology 0983/31 May/June 2022
No ratings yet
Cambridge IGCSE™ (9-1) : Information and Communication Technology 0983/31 May/June 2022
8 pages
HCI Report
No ratings yet
HCI Report
39 pages
Controller For Adaptive 100/120Hz Current Ripple Removing Circuit
No ratings yet
Controller For Adaptive 100/120Hz Current Ripple Removing Circuit
11 pages
Intelligent Optogenetics System
No ratings yet
Intelligent Optogenetics System
31 pages
Designing High-Precision, High Output Current Circuits For ATE Applications Using A Composite Amplifier Loop
No ratings yet
Designing High-Precision, High Output Current Circuits For ATE Applications Using A Composite Amplifier Loop
3 pages
OHS 1st Module CSS 11
No ratings yet
OHS 1st Module CSS 11
2 pages
Windows Operating System
No ratings yet
Windows Operating System
74 pages
Example Verilog Unit 5
No ratings yet
Example Verilog Unit 5
8 pages
Lib Burst Generated
No ratings yet
Lib Burst Generated
7 pages
GIS Practical No 4
No ratings yet
GIS Practical No 4
21 pages
Curriculum Vitae Sample Law
100% (1)
Curriculum Vitae Sample Law
8 pages
Cluster Analysis in Python Chapter4 PDF
No ratings yet
Cluster Analysis in Python Chapter4 PDF
30 pages
q1 w1 Career and Business
No ratings yet
q1 w1 Career and Business
18 pages
Axon Guide 3rdedition
No ratings yet
Axon Guide 3rdedition
286 pages
Finding Domain and Range of A Function G8
100% (1)
Finding Domain and Range of A Function G8
27 pages
Closure, Currying and IIFE in JavaScript - DEV Community ? ?? ? PDF
No ratings yet
Closure, Currying and IIFE in JavaScript - DEV Community ? ?? ? PDF
1 page

hw-3

Uploaded by

hw-3

Uploaded by

3/17/24, 12:00 AM hw-3

HW-3 • Math 189 • Wi 2024

longitude is continuous quantitative

Out[ ]: longitude float64

OLS Regression Results

# Plot the residuals

VIF: ocean_proximity[T.INLAND]: 2.860

OLS Regression Results

Out[ ]: year age race sex educ relactiv pid abortion

Out[ ]: year category

Optimization terminated successfully.

educ 0.1259 0.012 10.938 0.000 0.103 0.148

Optimization terminated successfully.

Optimization terminated successfully.

Date: Sat, 16 Mar 2024 Pseudo R-squ.: 0.1248

j. Plot the ROC curve and compute the AUC of model3 .

You might also like