hw-3
hw-3
PID: A17054351
Instructions
Submit your solutions online on Gradescope
Look at the detailed instructions here
I certify that the following write-up is my own work, and have abided by the UCSD Academic Integrity Guidelines.
Yes
No
In [ ]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sklearn
## Configurations
%matplotlib inline
file:///Users/dylanoquendo/Downloads/hw-3.html 1/20
3/17/24, 12:00 AM hw-3
/var/folders/px/_70k765d22d1hlv8t1lktgw40000gn/T/ipykernel_19238/1996746605.py:2: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other
libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
Question 1
Linear regression
The data folder contains the housing.csv dataset which contains housing prices in California from the 1990 California
census. The objective is to predict the median house price for California districts based on various features. The features are
the following:
1. longitude : A measure of how far west a house is; a higher value is farther west
2. latitude : A measure of how far north a house is; a higher value is farther north
3. housing_median_age : Median age of a house within a block; a lower number is a newer building
4. total_rooms : Total number of rooms within a block
5. total_bedrooms : Total number of bedrooms within a block
6. population : Total number of people residing within a block
7. households : Total number of households, a group of people residing within a home unit, for a block
8. median_income : Median income for households within a block of houses
9. median_house_value : Median house value for households within a block
10. ocean_proximity : Location of the house w.r.t ocean/sea
a. Load the dataset and display the first 5 rows of the dataset.
In [ ]: path = 'https://raw.githubusercontent.com/ucsd-math189-wi24/materials/main/data/housing.csv'
df = pd.read_csv(path)
file:///Users/dylanoquendo/Downloads/hw-3.html 2/20
3/17/24, 12:00 AM hw-3
df.head()
Out[ ]: longitude latitude housing_median_age total_rooms total_bedrooms population households median_income med
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462
b. Describe the data type (e.g., categorical, discrete quantitative, etc.) of each variable in the dataset. If you
identify any categorical variables, explicitly convert them to categorical variables in your pandas dataframe.
In [ ]: df.dtypes
file:///Users/dylanoquendo/Downloads/hw-3.html 3/20
3/17/24, 12:00 AM hw-3
c. Fit a linear regression model to predict the median_house_value based on all other covariates.
In [ ]: full_model = smf.ols(formula="median_house_value ~ longitude + latitude + housing_median_age + total_rooms
print(full_model.summary())
file:///Users/dylanoquendo/Downloads/hw-3.html 4/20
3/17/24, 12:00 AM hw-3
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.24e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [ ]: residuals = full_model.resid
file:///Users/dylanoquendo/Downloads/hw-3.html 5/20
3/17/24, 12:00 AM hw-3
plt.figure(figsize=(8, 6))
plt.scatter(full_model.fittedvalues, residuals, alpha=0.5)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.axhline(y=0, color='r', linestyle='--')
plt.grid(True)
plt.show()
file:///Users/dylanoquendo/Downloads/hw-3.html 6/20
3/17/24, 12:00 AM hw-3
d. Based on the summary of the linear regression model, do you think this is a good fit for the data? Explain your
answer.
No, since the condition number was extremely large, and after plotting the residuals, there is a clear pattern present.
e. Comment on the model assumptions and to what extent they are satisfied or not satisfied.
For smf OLS regression, we assume that the relationship between variables is linear, meaning that changes in one
variable are consistently associated with changes in another. We also assume that the errors in our predictions are
independent, meaning one error doesn't predict another, and that the spread of these errors remains consistent across all
values of the independent variables. Additionally, we assume that the errors follow a normal distribution, that there is no
perfect relationship between independent variables (multicollinearity), and that errors don't follow a pattern over time or
across observations (autocorrelation).
Based on summary notes, there seems to be a violation of multicollinearity, and based on my risidual plot, there appears
to be a pattern, violating the assumption of homoscedasticity
f. Compute the variance inflation factor (VIF) for each covariate. What do you observe?
In [ ]: exog = full_model.model.exog
names = full_model.params.index
for i in range(1, exog.shape[1]):
print(f'VIF: {names[i]}: {variance_inflation_factor(exog, i): .3f}')
file:///Users/dylanoquendo/Downloads/hw-3.html 7/20
3/17/24, 12:00 AM hw-3
g. Drop the covariate(s) with a variance inflation factor greater than 5 and fit the linear regression model again.
In [ ]: full_model = smf.ols(formula="median_house_value ~ housing_median_age + median_income + ocean_proximity", d
print(full_model.summary())
file:///Users/dylanoquendo/Downloads/hw-3.html 8/20
3/17/24, 12:00 AM hw-3
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.02e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
h. Based on the summary of the regression model new_model interpret the coefficients of the covariates.
Intercept: The intercept coefficient represents the predicted median house value when all other predictors are zero. In
this case, it's $51,510.
ocean_proximity Categories:
file:///Users/dylanoquendo/Downloads/hw-3.html 9/20
3/17/24, 12:00 AM hw-3
For each category of ocean_proximity, such as INLAND, ISLAND, NEAR BAY, and NEAR OCEAN, the coefficients
represent the difference in median house value compared to the reference category (which is not shown in the
summary but would be one of the categories if the variable is categorical with multiple levels).
For example, the coefficient for ocean_proximity[T.INLAND] is -
71, 650.T hismeansthat, onaverage, housesinlandhaveamedianvaluethatis 71,650 lower than the reference
category.
Similarly, the coefficients for ocean_proximity[T.ISLAND], ocean_proximity[T.NEAR BAY], and
ocean_proximity[T.NEAR OCEAN] indicate the difference in median house value for houses located on islands, near
the bay, and near the ocean, respectively, compared to the reference category.
housing_median_age: For every one-unit increase in housing median age, the median house value is expected to
increase by $926.82, on average.
median_income: For every one-unit increase in median income, the median house value is expected to increase by
$38,160, on average.
i. Holding all other covariates constant, which of the ocean_proximity categories do you expect to find a
house with the highest median house value? Why?
In [ ]: df.ocean_proximity.value_counts()
Out[ ]: ocean_proximity
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: count, dtype: int64
The island one, since the coefficient is 1.849e+05, which means that the median_house_value in island areas would have
about $184,900 higher value than those in the <1H OCEAN home area. And this coefficient is the largest of all the categories.
file:///Users/dylanoquendo/Downloads/hw-3.html 10/20
3/17/24, 12:00 AM hw-3
Question 2
For this question, we are going to use the abortion dataset which consists of Abortion Opinions in the General Social
Survey (GSS) from 1977 to 2018. The article related to the dataset can be found here.
The data has been preprocessed and is available in the data folder as abortion.csv . The dataset contains the following
columns:
1. abortion : Do you think that abortion should be legal for any reason?
2. year : Year of the survey
3. age : Respondent's age
4. sex : Respondent's sex
5. race : Respondent's race
6. education : How many years of education has the respondent completed
7. relactiv : Self-reported religiosity
8. pid : Respondent's political party identification (0: strong democrat ... 6: strong republican)
a. Load the dataset and display the first 5 rows of the dataset.
In [ ]: path = 'https://raw.githubusercontent.com/ucsd-math189-wi24/materials/main/data/abortion.csv'
df = pd.read_csv(path)
df.head()
file:///Users/dylanoquendo/Downloads/hw-3.html 11/20
3/17/24, 12:00 AM hw-3
b. Summarize the data type (e.g., categorical, discrete quantitative, etc.) of each variable in the dataset. If you
identify any categorical variables, explicitly convert them to categorical variables in your pandas dataframe.
year is categorical
age is discrete quantitative
race is categorical
sex is categorical
educ is discrete quantitative
relactiv is categorical
pid is categorical
abortion is categorical
In [ ]: df.year = df.year.astype('category')
df.race = df.race.astype('category')
df.sex = df.sex.astype('category')
df.relactiv = df.relactiv.astype('category')
df.pid = df.pid.astype('category')
df.abortion = df.abortion.astype('category')
c. Fit a logistic regression model to predict the abortion based on all other covariates.
In [ ]: df['abortion'] = df['abortion'].cat.codes
file:///Users/dylanoquendo/Downloads/hw-3.html 12/20
3/17/24, 12:00 AM hw-3
df.dtypes
In [ ]: model1 = smf.logit('abortion ~ year + age + race+ sex + educ + relactiv + pid', data=df).fit()
print(model1.summary())
file:///Users/dylanoquendo/Downloads/hw-3.html 13/20
3/17/24, 12:00 AM hw-3
file:///Users/dylanoquendo/Downloads/hw-3.html 14/20
3/17/24, 12:00 AM hw-3
d. Identify the covariates which are statistically significant at a 15% significance level.
Race[T.Other], sex[T.Male], all the relactiv, all the pid, age, and educ were significant
e. Based on the variables you identified in part d, fit a new logistic regression model only including those
covariates.
In [ ]: model2 = smf.logit('abortion ~ age + race + sex + educ + relactiv + pid', data=df).fit()
print(model2.summary())
file:///Users/dylanoquendo/Downloads/hw-3.html 15/20
3/17/24, 12:00 AM hw-3
f. Include an interaction term between sex and pid in your logistic regression model.
file:///Users/dylanoquendo/Downloads/hw-3.html 16/20
3/17/24, 12:00 AM hw-3
In [ ]: df['sex'] = df['sex'].cat.codes
df['pid'] = df['pid'].cat.codes
df['sex_pid_interaction'] = df['sex'] * df['pid']
In [ ]: model3 = smf.logit('abortion ~ age + race + sex + educ + relactiv + pid + sex_pid_interaction', data=df).fi
print(model2.summary())
print(model3.summary())
file:///Users/dylanoquendo/Downloads/hw-3.html 17/20
3/17/24, 12:00 AM hw-3
file:///Users/dylanoquendo/Downloads/hw-3.html 18/20
3/17/24, 12:00 AM hw-3
g. Is there sufficient evidence to conclude that the sex moderates the effect of pid on abortion opinion?
Explain your answer.
The coefficient of 0.0587 suggests that there is a positive relationship between the interaction of 'sex' and 'pid' with the
abortion opinion. However, the p-value of 0.111 indicates that this coefficient is not statistically significant at conventional
levels (e.g., α = 0.05).
h. Interpret each coefficient associated with the covariates in the new logistic regression model, model3 .
file:///Users/dylanoquendo/Downloads/hw-3.html 19/20
3/17/24, 12:00 AM hw-3
Categorical Variables (Race and Sex): For categorical variables like race and sex, each coefficient represents the
difference in log-odds of the outcome relative to the reference category. For example:
For race: The coefficient for 'race[T.Other]' (-0.2210) represents the difference in log-odds of the outcome between
the 'Other' race category and the reference category (e.g., 'White' race), holding other variables constant.
For sex: The coefficient for 'sex[T.Male]' (0.2159) represents the difference in log-odds of the outcome between
males and females, holding other variables constant.
Ordinal Variables (Education, PID): For ordinal variables like education and PID (political identification), each
coefficient represents the change in log-odds of the outcome associated with a one-unit increase in the predictor,
holding other variables constant. For example:
For education: The coefficient of 0.1292 indicates that for each one-unit increase in education level, the log-odds of the
outcome increase by 0.1292, holding other variables constant.
For PID: The coefficient of -0.2825 indicates that for each one-unit increase in political identification, the log-odds of the
outcome decrease by 0.2825, holding other variables constant.
Interaction Term (Sex_Pid_Interaction): The coefficient for the interaction term represents the change in the log-odds of
the outcome associated with the interaction between 'sex' and 'pid' variables. In this case, the coefficient of 0.0587
implies the change in the log-odds of the outcome for each unit change in the interaction between 'sex' and 'pid', holding
other variables constant.
i. Print the confusion matrix and report the classification accuracy of model3 .
In [ ]: ...
file:///Users/dylanoquendo/Downloads/hw-3.html 20/20