0% found this document useful (0 votes)
5 views

hw-3

The document is a homework assignment for Math 189, focusing on linear regression analysis using a dataset of housing prices in California. It includes instructions for submission, data preparation, model fitting, and evaluation of model assumptions, as well as variance inflation factor calculations to address multicollinearity. The student, Dylan Oquendo, is required to analyze the data and provide insights based on the regression results.

Uploaded by

dande.t.lion
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

hw-3

The document is a homework assignment for Math 189, focusing on linear regression analysis using a dataset of housing prices in California. It includes instructions for submission, data preparation, model fitting, and evaluation of model assumptions, as well as variance inflation factor calculations to address multicollinearity. The student, Dylan Oquendo, is required to analyze the data and provide insights based on the regression results.

Uploaded by

dande.t.lion
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

3/17/24, 12:00 AM hw-3

HW-3 • Math 189 • Wi 2024


Due Date: Sat, Mar 16th 2024
NAME: Dylan Oquendo

PID: A17054351

Instructions
Submit your solutions online on Gradescope
Look at the detailed instructions here
I certify that the following write-up is my own work, and have abided by the UCSD Academic Integrity Guidelines.
Yes
No

In [ ]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sklearn

## Configurations
%matplotlib inline

file:///Users/dylanoquendo/Downloads/hw-3.html 1/20
3/17/24, 12:00 AM hw-3

/var/folders/px/_70k765d22d1hlv8t1lktgw40000gn/T/ipykernel_19238/1996746605.py:2: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other
libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

import pandas as pd

Question 1
Linear regression
The data folder contains the housing.csv dataset which contains housing prices in California from the 1990 California
census. The objective is to predict the median house price for California districts based on various features. The features are
the following:
1. longitude : A measure of how far west a house is; a higher value is farther west
2. latitude : A measure of how far north a house is; a higher value is farther north
3. housing_median_age : Median age of a house within a block; a lower number is a newer building
4. total_rooms : Total number of rooms within a block
5. total_bedrooms : Total number of bedrooms within a block
6. population : Total number of people residing within a block
7. households : Total number of households, a group of people residing within a home unit, for a block
8. median_income : Median income for households within a block of houses
9. median_house_value : Median house value for households within a block
10. ocean_proximity : Location of the house w.r.t ocean/sea
a. Load the dataset and display the first 5 rows of the dataset.
In [ ]: path = 'https://raw.githubusercontent.com/ucsd-math189-wi24/materials/main/data/housing.csv'
df = pd.read_csv(path)

file:///Users/dylanoquendo/Downloads/hw-3.html 2/20
3/17/24, 12:00 AM hw-3

df.head()

Out[ ]: longitude latitude housing_median_age total_rooms total_bedrooms population households median_income med
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462

b. Describe the data type (e.g., categorical, discrete quantitative, etc.) of each variable in the dataset. If you
identify any categorical variables, explicitly convert them to categorical variables in your pandas dataframe.

longitude is continuous quantitative


latitude is continuous quantitative
housing_median_age is discrete quantitative
total_rooms is discrete quantitative
population is discrete quantitative
households is discrete quantitative
median_income is continuous quantitative
median_house_value is discrete quantitative
ocean_proximity is categorical
In [ ]: df.ocean_proximity = df.ocean_proximity.astype('category')

In [ ]: df.dtypes

file:///Users/dylanoquendo/Downloads/hw-3.html 3/20
3/17/24, 12:00 AM hw-3

Out[ ]: longitude float64


latitude float64
housing_median_age float64
total_rooms float64
total_bedrooms float64
population float64
households float64
median_income float64
median_house_value float64
ocean_proximity category
dtype: object

c. Fit a linear regression model to predict the median_house_value based on all other covariates.
In [ ]: full_model = smf.ols(formula="median_house_value ~ longitude + latitude + housing_median_age + total_rooms
print(full_model.summary())

file:///Users/dylanoquendo/Downloads/hw-3.html 4/20
3/17/24, 12:00 AM hw-3

OLS Regression Results


==============================================================================
Dep. Variable: median_house_value R-squared: 0.646
Model: OLS Adj. R-squared: 0.646
Method: Least Squares F-statistic: 3112.
Date: Sat, 16 Mar 2024 Prob (F-statistic): 0.00
Time: 22:51:58 Log-Likelihood: -2.5655e+05
No. Observations: 20433 AIC: 5.131e+05
Df Residuals: 20420 BIC: 5.132e+05
Df Model: 12
Covariance Type: nonrobust
=================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------
Intercept -2.27e+06 8.8e+04 -25.791 0.000 -2.44e+06 -2.1e+06
ocean_proximity[T.INLAND] -3.928e+04 1744.258 -22.522 0.000 -4.27e+04 -3.59e+04
ocean_proximity[T.ISLAND] 1.529e+05 3.07e+04 4.974 0.000 9.26e+04 2.13e+05
ocean_proximity[T.NEAR BAY] -3954.0516 1913.339 -2.067 0.039 -7704.350 -203.753
ocean_proximity[T.NEAR OCEAN] 4278.1343 1569.525 2.726 0.006 1201.739 7354.530
longitude -2.681e+04 1019.651 -26.296 0.000 -2.88e+04 -2.48e+04
latitude -2.548e+04 1004.702 -25.363 0.000 -2.75e+04 -2.35e+04
housing_median_age 1072.5200 43.886 24.439 0.000 986.501 1158.540
total_rooms -6.1933 0.791 -7.825 0.000 -7.745 -4.642
total_bedrooms 100.5563 6.869 14.640 0.000 87.093 114.019
population -37.9691 1.076 -35.282 0.000 -40.078 -35.860
households 49.6173 7.451 6.659 0.000 35.012 64.222
median_income 3.926e+04 338.005 116.151 0.000 3.86e+04 3.99e+04
==============================================================================
Omnibus: 5049.292 Durbin-Watson: 0.977
Prob(Omnibus): 0.000 Jarque-Bera (JB): 19123.138
Skew: 1.197 Prob(JB): 0.00
Kurtosis: 7.090 Cond. No. 7.24e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.24e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

In [ ]: residuals = full_model.resid

# Plot the residuals

file:///Users/dylanoquendo/Downloads/hw-3.html 5/20
3/17/24, 12:00 AM hw-3

plt.figure(figsize=(8, 6))
plt.scatter(full_model.fittedvalues, residuals, alpha=0.5)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.axhline(y=0, color='r', linestyle='--')
plt.grid(True)
plt.show()

file:///Users/dylanoquendo/Downloads/hw-3.html 6/20
3/17/24, 12:00 AM hw-3

d. Based on the summary of the linear regression model, do you think this is a good fit for the data? Explain your
answer.
No, since the condition number was extremely large, and after plotting the residuals, there is a clear pattern present.
e. Comment on the model assumptions and to what extent they are satisfied or not satisfied.

For smf OLS regression, we assume that the relationship between variables is linear, meaning that changes in one
variable are consistently associated with changes in another. We also assume that the errors in our predictions are
independent, meaning one error doesn't predict another, and that the spread of these errors remains consistent across all
values of the independent variables. Additionally, we assume that the errors follow a normal distribution, that there is no
perfect relationship between independent variables (multicollinearity), and that errors don't follow a pattern over time or
across observations (autocorrelation).
Based on summary notes, there seems to be a violation of multicollinearity, and based on my risidual plot, there appears
to be a pattern, violating the assumption of homoscedasticity
f. Compute the variance inflation factor (VIF) for each covariate. What do you observe?
In [ ]: exog = full_model.model.exog
names = full_model.params.index
for i in range(1, exog.shape[1]):
print(f'VIF: {names[i]}: {variance_inflation_factor(exog, i): .3f}')

file:///Users/dylanoquendo/Downloads/hw-3.html 7/20
3/17/24, 12:00 AM hw-3

VIF: ocean_proximity[T.INLAND]: 2.860


VIF: ocean_proximity[T.ISLAND]: 1.002
VIF: ocean_proximity[T.NEAR BAY]: 1.567
VIF: ocean_proximity[T.NEAR OCEAN]: 1.197
VIF: longitude: 18.091
VIF: latitude: 19.969
VIF: housing_median_age: 1.324
VIF: total_rooms: 12.966
VIF: total_bedrooms: 36.310
VIF: population: 6.446
VIF: households: 35.173
VIF: median_income: 1.786

g. Drop the covariate(s) with a variance inflation factor greater than 5 and fit the linear regression model again.
In [ ]: full_model = smf.ols(formula="median_house_value ~ housing_median_age + median_income + ocean_proximity", d
print(full_model.summary())

file:///Users/dylanoquendo/Downloads/hw-3.html 8/20
3/17/24, 12:00 AM hw-3

OLS Regression Results


==============================================================================
Dep. Variable: median_house_value R-squared: 0.597
Model: OLS Adj. R-squared: 0.597
Method: Least Squares F-statistic: 5092.
Date: Sat, 16 Mar 2024 Prob (F-statistic): 0.00
Time: 22:53:37 Log-Likelihood: -2.6049e+05
No. Observations: 20640 AIC: 5.210e+05
Df Residuals: 20633 BIC: 5.211e+05
Df Model: 6
Covariance Type: nonrobust
=================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------
Intercept 5.151e+04 2049.957 25.130 0.000 4.75e+04 5.55e+04
ocean_proximity[T.INLAND] -7.165e+04 1249.504 -57.345 0.000 -7.41e+04 -6.92e+04
ocean_proximity[T.ISLAND] 1.849e+05 3.28e+04 5.640 0.000 1.21e+05 2.49e+05
ocean_proximity[T.NEAR BAY] 1.35e+04 1750.783 7.711 0.000 1.01e+04 1.69e+04
ocean_proximity[T.NEAR OCEAN] 1.787e+04 1616.074 11.057 0.000 1.47e+04 2.1e+04
housing_median_age 926.8172 43.459 21.326 0.000 841.635 1011.999
median_income 3.816e+04 281.715 135.448 0.000 3.76e+04 3.87e+04
==============================================================================
Omnibus: 4646.761 Durbin-Watson: 0.815
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12388.581
Skew: 1.211 Prob(JB): 0.00
Kurtosis: 5.922 Cond. No. 2.02e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.02e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

h. Based on the summary of the regression model new_model interpret the coefficients of the covariates.

Intercept: The intercept coefficient represents the predicted median house value when all other predictors are zero. In
this case, it's $51,510.
ocean_proximity Categories:
file:///Users/dylanoquendo/Downloads/hw-3.html 9/20
3/17/24, 12:00 AM hw-3

For each category of ocean_proximity, such as INLAND, ISLAND, NEAR BAY, and NEAR OCEAN, the coefficients
represent the difference in median house value compared to the reference category (which is not shown in the
summary but would be one of the categories if the variable is categorical with multiple levels).
For example, the coefficient for ocean_proximity[T.INLAND] is -
71, 650.T hismeansthat, onaverage, housesinlandhaveamedianvaluethatis 71,650 lower than the reference
category.
Similarly, the coefficients for ocean_proximity[T.ISLAND], ocean_proximity[T.NEAR BAY], and
ocean_proximity[T.NEAR OCEAN] indicate the difference in median house value for houses located on islands, near
the bay, and near the ocean, respectively, compared to the reference category.
housing_median_age: For every one-unit increase in housing median age, the median house value is expected to
increase by $926.82, on average.
median_income: For every one-unit increase in median income, the median house value is expected to increase by
$38,160, on average.

i. Holding all other covariates constant, which of the ocean_proximity categories do you expect to find a
house with the highest median house value? Why?
In [ ]: df.ocean_proximity.value_counts()

Out[ ]: ocean_proximity
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: count, dtype: int64

The island one, since the coefficient is 1.849e+05, which means that the median_house_value in island areas would have
about $184,900 higher value than those in the <1H OCEAN home area. And this coefficient is the largest of all the categories.

file:///Users/dylanoquendo/Downloads/hw-3.html 10/20
3/17/24, 12:00 AM hw-3

Question 2
For this question, we are going to use the abortion dataset which consists of Abortion Opinions in the General Social
Survey (GSS) from 1977 to 2018. The article related to the dataset can be found here.
The data has been preprocessed and is available in the data folder as abortion.csv . The dataset contains the following
columns:
1. abortion : Do you think that abortion should be legal for any reason?
2. year : Year of the survey
3. age : Respondent's age
4. sex : Respondent's sex
5. race : Respondent's race
6. education : How many years of education has the respondent completed
7. relactiv : Self-reported religiosity
8. pid : Respondent's political party identification (0: strong democrat ... 6: strong republican)

a. Load the dataset and display the first 5 rows of the dataset.
In [ ]: path = 'https://raw.githubusercontent.com/ucsd-math189-wi24/materials/main/data/abortion.csv'
df = pd.read_csv(path)
df.head()

file:///Users/dylanoquendo/Downloads/hw-3.html 11/20
3/17/24, 12:00 AM hw-3

Out[ ]: year age race sex educ relactiv pid abortion


0 2006 50.0 Black Female 13.0 4.0 0.0 1.0
1 2006 50.0 Black Female 12.0 1.0 0.0 1.0
2 2006 20.0 Black Male 14.0 1.0 0.0 1.0
3 2006 29.0 Black Female 12.0 1.0 3.0 1.0
4 2006 23.0 Black Female 16.0 1.0 0.0 1.0

b. Summarize the data type (e.g., categorical, discrete quantitative, etc.) of each variable in the dataset. If you
identify any categorical variables, explicitly convert them to categorical variables in your pandas dataframe.

year is categorical
age is discrete quantitative
race is categorical
sex is categorical
educ is discrete quantitative
relactiv is categorical
pid is categorical
abortion is categorical
In [ ]: df.year = df.year.astype('category')
df.race = df.race.astype('category')
df.sex = df.sex.astype('category')
df.relactiv = df.relactiv.astype('category')
df.pid = df.pid.astype('category')
df.abortion = df.abortion.astype('category')

c. Fit a logistic regression model to predict the abortion based on all other covariates.
In [ ]: df['abortion'] = df['abortion'].cat.codes

file:///Users/dylanoquendo/Downloads/hw-3.html 12/20
3/17/24, 12:00 AM hw-3

df.dtypes

Out[ ]: year category


age float64
race category
sex category
educ float64
relactiv category
pid category
abortion int8
dtype: object

In [ ]: model1 = smf.logit('abortion ~ year + age + race+ sex + educ + relactiv + pid', data=df).fit()
print(model1.summary())

file:///Users/dylanoquendo/Downloads/hw-3.html 13/20
3/17/24, 12:00 AM hw-3

Optimization terminated successfully.


Current function value: 0.282245
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: abortion No. Observations: 10133
Model: Logit Df Residuals: 10106
Method: MLE Df Model: 26
Date: Sat, 16 Mar 2024 Pseudo R-squ.: 0.1299
Time: 23:56:21 Log-Likelihood: -2860.0
converged: True LL-Null: -3286.8
Covariance Type: nonrobust LLR p-value: 3.396e-163
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 1.5849 0.242 6.539 0.000 1.110 2.060
year[T.2008] 0.0264 0.130 0.203 0.839 -0.229 0.282
year[T.2010] -0.0550 0.130 -0.422 0.673 -0.310 0.200
year[T.2012] -0.1799 0.127 -1.414 0.157 -0.429 0.069
year[T.2014] -0.0558 0.121 -0.461 0.645 -0.293 0.182
year[T.2016] -0.0979 0.117 -0.837 0.403 -0.327 0.131
year[T.2018] 0.0730 0.127 0.573 0.566 -0.177 0.323
race[T.Other] -0.2239 0.142 -1.573 0.116 -0.503 0.055
race[T.White] 0.0364 0.114 0.321 0.748 -0.186 0.259
sex[T.Male] 0.2191 0.073 3.017 0.003 0.077 0.361
relactiv[T.2.0] 0.2103 0.198 1.063 0.288 -0.177 0.598
relactiv[T.3.0] -0.4240 0.124 -3.432 0.001 -0.666 -0.182
relactiv[T.4.0] -1.1479 0.113 -10.200 0.000 -1.368 -0.927
relactiv[T.5.0] -0.9510 0.141 -6.761 0.000 -1.227 -0.675
relactiv[T.6.0] -1.5177 0.127 -11.937 0.000 -1.767 -1.268
relactiv[T.7.0] -1.8855 0.165 -11.432 0.000 -2.209 -1.562
relactiv[T.8.0] -1.7725 0.117 -15.204 0.000 -2.001 -1.544
relactiv[T.9.0] -2.0374 0.297 -6.852 0.000 -2.620 -1.455
relactiv[T.10.0] -1.4981 0.393 -3.807 0.000 -2.269 -0.727
pid[T.1.0] -0.1959 0.152 -1.290 0.197 -0.493 0.102
pid[T.2.0] 0.0460 0.179 0.257 0.797 -0.305 0.397
pid[T.3.0] -0.8528 0.139 -6.143 0.000 -1.125 -0.581
pid[T.4.0] -1.1361 0.157 -7.259 0.000 -1.443 -0.829
pid[T.5.0] -0.9807 0.146 -6.695 0.000 -1.268 -0.694
pid[T.6.0] -1.5697 0.143 -10.961 0.000 -1.850 -1.289
age 0.0069 0.002 3.244 0.001 0.003 0.011

file:///Users/dylanoquendo/Downloads/hw-3.html 14/20
3/17/24, 12:00 AM hw-3

educ 0.1259 0.012 10.938 0.000 0.103 0.148


====================================================================================

d. Identify the covariates which are statistically significant at a 15% significance level.
Race[T.Other], sex[T.Male], all the relactiv, all the pid, age, and educ were significant
e. Based on the variables you identified in part d, fit a new logistic regression model only including those
covariates.
In [ ]: model2 = smf.logit('abortion ~ age + race + sex + educ + relactiv + pid', data=df).fit()
print(model2.summary())

file:///Users/dylanoquendo/Downloads/hw-3.html 15/20
3/17/24, 12:00 AM hw-3

Optimization terminated successfully.


Current function value: 0.282467
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: abortion No. Observations: 10133
Model: Logit Df Residuals: 10112
Method: MLE Df Model: 20
Date: Sat, 16 Mar 2024 Pseudo R-squ.: 0.1292
Time: 23:56:24 Log-Likelihood: -2862.2
converged: True LL-Null: -3286.8
Covariance Type: nonrobust LLR p-value: 5.152e-167
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 1.5504 0.233 6.667 0.000 1.095 2.006
race[T.Other] -0.2210 0.142 -1.555 0.120 -0.500 0.058
race[T.White] 0.0383 0.113 0.337 0.736 -0.184 0.261
sex[T.Male] 0.2159 0.073 2.976 0.003 0.074 0.358
relactiv[T.2.0] 0.2103 0.198 1.064 0.287 -0.177 0.598
relactiv[T.3.0] -0.4260 0.123 -3.450 0.001 -0.668 -0.184
relactiv[T.4.0] -1.1467 0.112 -10.203 0.000 -1.367 -0.926
relactiv[T.5.0] -0.9512 0.141 -6.769 0.000 -1.227 -0.676
relactiv[T.6.0] -1.5176 0.127 -11.950 0.000 -1.767 -1.269
relactiv[T.7.0] -1.8906 0.165 -11.463 0.000 -2.214 -1.567
relactiv[T.8.0] -1.7721 0.116 -15.219 0.000 -2.000 -1.544
relactiv[T.9.0] -2.0340 0.297 -6.849 0.000 -2.616 -1.452
relactiv[T.10.0] -1.4910 0.394 -3.787 0.000 -2.263 -0.719
pid[T.1.0] -0.1990 0.152 -1.312 0.189 -0.496 0.098
pid[T.2.0] 0.0483 0.179 0.270 0.787 -0.302 0.399
pid[T.3.0] -0.8510 0.139 -6.139 0.000 -1.123 -0.579
pid[T.4.0] -1.1319 0.156 -7.245 0.000 -1.438 -0.826
pid[T.5.0] -0.9770 0.146 -6.680 0.000 -1.264 -0.690
pid[T.6.0] -1.5627 0.143 -10.926 0.000 -1.843 -1.282
age 0.0068 0.002 3.209 0.001 0.003 0.011
educ 0.1255 0.011 10.940 0.000 0.103 0.148
====================================================================================

f. Include an interaction term between sex and pid in your logistic regression model.

file:///Users/dylanoquendo/Downloads/hw-3.html 16/20
3/17/24, 12:00 AM hw-3

In [ ]: df['sex'] = df['sex'].cat.codes
df['pid'] = df['pid'].cat.codes
df['sex_pid_interaction'] = df['sex'] * df['pid']

In [ ]: model3 = smf.logit('abortion ~ age + race + sex + educ + relactiv + pid + sex_pid_interaction', data=df).fi
print(model2.summary())
print(model3.summary())

file:///Users/dylanoquendo/Downloads/hw-3.html 17/20
3/17/24, 12:00 AM hw-3

Optimization terminated successfully.


Current function value: 0.283887
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: abortion No. Observations: 10133
Model: Logit Df Residuals: 10112
Method: MLE Df Model: 20
Date: Sat, 16 Mar 2024 Pseudo R-squ.: 0.1292
Time: 23:57:01 Log-Likelihood: -2862.2
converged: True LL-Null: -3286.8
Covariance Type: nonrobust LLR p-value: 5.152e-167
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 1.5504 0.233 6.667 0.000 1.095 2.006
race[T.Other] -0.2210 0.142 -1.555 0.120 -0.500 0.058
race[T.White] 0.0383 0.113 0.337 0.736 -0.184 0.261
sex[T.Male] 0.2159 0.073 2.976 0.003 0.074 0.358
relactiv[T.2.0] 0.2103 0.198 1.064 0.287 -0.177 0.598
relactiv[T.3.0] -0.4260 0.123 -3.450 0.001 -0.668 -0.184
relactiv[T.4.0] -1.1467 0.112 -10.203 0.000 -1.367 -0.926
relactiv[T.5.0] -0.9512 0.141 -6.769 0.000 -1.227 -0.676
relactiv[T.6.0] -1.5176 0.127 -11.950 0.000 -1.767 -1.269
relactiv[T.7.0] -1.8906 0.165 -11.463 0.000 -2.214 -1.567
relactiv[T.8.0] -1.7721 0.116 -15.219 0.000 -2.000 -1.544
relactiv[T.9.0] -2.0340 0.297 -6.849 0.000 -2.616 -1.452
relactiv[T.10.0] -1.4910 0.394 -3.787 0.000 -2.263 -0.719
pid[T.1.0] -0.1990 0.152 -1.312 0.189 -0.496 0.098
pid[T.2.0] 0.0483 0.179 0.270 0.787 -0.302 0.399
pid[T.3.0] -0.8510 0.139 -6.139 0.000 -1.123 -0.579
pid[T.4.0] -1.1319 0.156 -7.245 0.000 -1.438 -0.826
pid[T.5.0] -0.9770 0.146 -6.680 0.000 -1.264 -0.690
pid[T.6.0] -1.5627 0.143 -10.926 0.000 -1.843 -1.282
age 0.0068 0.002 3.209 0.001 0.003 0.011
educ 0.1255 0.011 10.940 0.000 0.103 0.148
====================================================================================
Logit Regression Results
==============================================================================
Dep. Variable: abortion No. Observations: 10133
Model: Logit Df Residuals: 10116
Method: MLE Df Model: 16

file:///Users/dylanoquendo/Downloads/hw-3.html 18/20
3/17/24, 12:00 AM hw-3

Date: Sat, 16 Mar 2024 Pseudo R-squ.: 0.1248


Time: 23:57:01 Log-Likelihood: -2876.6
converged: True LL-Null: -3286.8
Covariance Type: nonrobust LLR p-value: 2.856e-164
=======================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 1.6479 0.210 7.832 0.000 1.235 2.060
race[T.Other] -0.2124 0.140 -1.515 0.130 -0.487 0.062
race[T.White] 0.0698 0.112 0.624 0.533 -0.150 0.289
relactiv[T.2.0] 0.2192 0.197 1.111 0.267 -0.168 0.606
relactiv[T.3.0] -0.4243 0.123 -3.446 0.001 -0.666 -0.183
relactiv[T.4.0] -1.1445 0.112 -10.227 0.000 -1.364 -0.925
relactiv[T.5.0] -0.9485 0.140 -6.775 0.000 -1.223 -0.674
relactiv[T.6.0] -1.5223 0.127 -12.031 0.000 -1.770 -1.274
relactiv[T.7.0] -1.8796 0.164 -11.432 0.000 -2.202 -1.557
relactiv[T.8.0] -1.7555 0.116 -15.178 0.000 -1.982 -1.529
relactiv[T.9.0] -2.0614 0.296 -6.972 0.000 -2.641 -1.482
relactiv[T.10.0] -1.5255 0.393 -3.885 0.000 -2.295 -0.756
age 0.0066 0.002 3.170 0.002 0.003 0.011
sex 0.0155 0.147 0.105 0.916 -0.273 0.304
educ 0.1292 0.011 11.473 0.000 0.107 0.151
pid -0.2825 0.024 -11.645 0.000 -0.330 -0.235
sex_pid_interaction 0.0587 0.037 1.593 0.111 -0.014 0.131
=======================================================================================

g. Is there sufficient evidence to conclude that the sex moderates the effect of pid on abortion opinion?
Explain your answer.
The coefficient of 0.0587 suggests that there is a positive relationship between the interaction of 'sex' and 'pid' with the
abortion opinion. However, the p-value of 0.111 indicates that this coefficient is not statistically significant at conventional
levels (e.g., α = 0.05).
h. Interpret each coefficient associated with the covariates in the new logistic regression model, model3 .

file:///Users/dylanoquendo/Downloads/hw-3.html 19/20
3/17/24, 12:00 AM hw-3

Categorical Variables (Race and Sex): For categorical variables like race and sex, each coefficient represents the
difference in log-odds of the outcome relative to the reference category. For example:
For race: The coefficient for 'race[T.Other]' (-0.2210) represents the difference in log-odds of the outcome between
the 'Other' race category and the reference category (e.g., 'White' race), holding other variables constant.
For sex: The coefficient for 'sex[T.Male]' (0.2159) represents the difference in log-odds of the outcome between
males and females, holding other variables constant.
Ordinal Variables (Education, PID): For ordinal variables like education and PID (political identification), each
coefficient represents the change in log-odds of the outcome associated with a one-unit increase in the predictor,
holding other variables constant. For example:
For education: The coefficient of 0.1292 indicates that for each one-unit increase in education level, the log-odds of the
outcome increase by 0.1292, holding other variables constant.
For PID: The coefficient of -0.2825 indicates that for each one-unit increase in political identification, the log-odds of the
outcome decrease by 0.2825, holding other variables constant.
Interaction Term (Sex_Pid_Interaction): The coefficient for the interaction term represents the change in the log-odds of
the outcome associated with the interaction between 'sex' and 'pid' variables. In this case, the coefficient of 0.0587
implies the change in the log-odds of the outcome for each unit change in the interaction between 'sex' and 'pid', holding
other variables constant.

i. Print the confusion matrix and report the classification accuracy of model3 .
In [ ]: ...

j. Plot the ROC curve and compute the AUC of model3 .


In [ ]: ...

file:///Users/dylanoquendo/Downloads/hw-3.html 20/20

You might also like