Project PDF
Project PDF
Dataset Selection: We have chosen the USA Real Estate Dataset from Kaggle, which contains
information about real estate properties in the USA.
https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset?
resource=download
Goal Definition: Goal is to build a system that predicts the cost of a home based on features
such as footage, number of beds, and baths, on a per-state basis.
Feature Selection:In this case, the relevant features are "footage," "beds," and "baths." These
features will be used to predict the cost of a home.
Preprocessing: Perform preprocessing steps on the dataset to prepare it for analysis. This may
include handling missing values, encoding categorical variables, and scaling numerical features.
Skipping this for now.
Exploration and Visualization: Here we are exploring the relationship between variables by using
plots such as histograms, scatter plots, or box plots.
In [12]: #df = pd.read_csv('/content/drive/MyDrive/realtor-data.csv')
#df.head(100)
df = pd.read_csv(r'C:\Users\Crystian\Documents\GitHub\CST383\realtor-data.csv')
df.head(20)
Out[12]: status bed bath acre_lot city state zip_code house_size prev_sold_date price
Puerto
0 for_sale 3.0 2.0 0.12 Adjuntas 601.0 920.0 NaN 105000.0
Rico
Puerto
1 for_sale 4.0 2.0 0.08 Adjuntas 601.0 1527.0 NaN 80000.0
Rico
Juana Puerto
2 for_sale 2.0 1.0 0.15 795.0 748.0 NaN 67000.0
Diaz Rico
Puerto
3 for_sale 4.0 2.0 0.10 Ponce 731.0 1800.0 NaN 145000.0
Rico
Puerto
4 for_sale 6.0 2.0 0.05 Mayaguez 680.0 NaN NaN 65000.0
Rico
San Puerto
5 for_sale 4.0 3.0 0.46 612.0 2520.0 NaN 179000.0
Sebastian Rico
Puerto
6 for_sale 3.0 1.0 0.20 Ciales 639.0 2040.0 NaN 50000.0
Rico
Puerto
7 for_sale 3.0 2.0 0.08 Ponce 731.0 1050.0 NaN 71600.0
Rico
Puerto
8 for_sale 2.0 1.0 0.09 Ponce 730.0 1092.0 NaN 100000.0
Rico
Las Puerto
9 for_sale 5.0 3.0 7.46 670.0 5403.0 NaN 300000.0
Marias Rico
Puerto
10 for_sale 3.0 2.0 13.39 Isabela 662.0 1106.0 NaN 89000.0
Rico
Juana Puerto
11 for_sale 3.0 2.0 0.08 795.0 1045.0 NaN 150000.0
Diaz Rico
Puerto
12 for_sale 3.0 2.0 0.10 Lares 669.0 4161.0 NaN 155000.0
Rico
Puerto
13 for_sale 5.0 2.0 0.12 Utuado 641.0 1620.0 NaN 79000.0
Rico
Puerto
14 for_sale 5.0 5.0 0.74 Ponce 731.0 2677.0 NaN 649000.0
Rico
Puerto
15 for_sale 3.0 2.0 0.08 Yauco 698.0 1100.0 NaN 120000.0
Rico
Puerto
16 for_sale 4.0 4.0 0.22 Mayaguez 680.0 3450.0 NaN 235000.0
Rico
Puerto
17 for_sale 3.0 2.0 0.08 Ponce 728.0 1500.0 NaN 105000.0
Rico
San Puerto
18 for_sale 3.0 2.0 3.88 685.0 4000.0 NaN 575000.0
Sebastian Rico
Puerto
19 for_sale 6.0 3.0 0.25 Anasco 610.0 1230.0 NaN 140000.0
Rico
New Section
In [13]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407890 entries, 0 to 407889
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 status 407890 non-null object
1 bed 320108 non-null float64
2 bath 321618 non-null float64
3 acre_lot 331873 non-null float64
4 city 407838 non-null object
5 state 407890 non-null object
6 zip_code 407693 non-null float64
7 house_size 324365 non-null float64
8 prev_sold_date 140950 non-null object
9 price 407890 non-null float64
dtypes: float64(6), object(4)
memory usage: 31.1+ MB
In [14]: df.describe()
Running checks to see what columns have data missing and calculating the percentage in
relation to total.
We will drop rows that have items missing from bed and bath, and house size to trim data to
have homes with a price, bed and bath numbers.
Creating a simple scatter plot to view how the number of beds relates to the cost of a home.
(Price of home in scientific notation)
df.head(15)
Out[20]: status bed bath acre_lot city state zip_code house_size prev_sold_date price
Puerto
0 for_sale 3.0 2.0 0.12 Adjuntas 601.0 920.0 NaN 105000.0
Rico
Puerto
1 for_sale 4.0 2.0 0.08 Adjuntas 601.0 1527.0 NaN 80000.0
Rico
Juana Puerto
2 for_sale 2.0 1.0 0.15 795.0 748.0 NaN 67000.0
Diaz Rico
Puerto
3 for_sale 4.0 2.0 0.10 Ponce 731.0 1800.0 NaN 145000.0
Rico
San Puerto
4 for_sale 4.0 3.0 0.46 612.0 2520.0 NaN 179000.0
Sebastian Rico
Puerto
5 for_sale 3.0 1.0 0.20 Ciales 639.0 2040.0 NaN 50000.0
Rico
Puerto
6 for_sale 3.0 2.0 0.08 Ponce 731.0 1050.0 NaN 71600.0
Rico
Puerto
7 for_sale 2.0 1.0 0.09 Ponce 730.0 1092.0 NaN 100000.0
Rico
Las Puerto
8 for_sale 5.0 3.0 7.46 670.0 5403.0 NaN 300000.0
Marias Rico
Puerto
9 for_sale 3.0 2.0 13.39 Isabela 662.0 1106.0 NaN 89000.0
Rico
Juana Puerto
10 for_sale 3.0 2.0 0.08 795.0 1045.0 NaN 150000.0
Diaz Rico
Puerto
11 for_sale 3.0 2.0 0.10 Lares 669.0 4161.0 NaN 155000.0
Rico
Puerto
12 for_sale 5.0 2.0 0.12 Utuado 641.0 1620.0 NaN 79000.0
Rico
Puerto
13 for_sale 5.0 5.0 0.74 Ponce 731.0 2677.0 NaN 649000.0
Rico
Puerto
14 for_sale 3.0 2.0 0.08 Yauco 698.0 1100.0 NaN 120000.0
Rico
In [24]: X = df[['bed']].values
y = df[['price']].values
reg = LinearRegression()
reg.fit(X,y)
intercept: 401245.26
coefficient for price: [11307.83415314]
r-sqaured value: 0.00
X = df[predictors]
y = df['price']
reg2 = LinearRegression()
reg2.fit(X,y)
print(f'intercept: {reg2.intercept_:.2f}')
print('coefficients:')
for i, coef in enumerate(reg2.coef_):
print(f' {predictors[i]}: {coef:.2f}')
intercept: 290364.02
coefficients:
bed: -31057.80
bath: 112282.24
results = reg2.predict(matrix)[0]
print(f'{results:.2f}')
534037.32
plt.title(title)
plt.xlabel('actual')
plt.ylabel('predicted')
plt.ticklabel_format(style='plain')
X = df[predictors]
y = df['price']
reg3 = LinearRegression()
reg3.fit(X,y)
print(f'intercept: {reg3.intercept_:.2f}')
print('coefficients:')
for i, coef in enumerate(reg3.coef_):
print(f' {predictors[i]}: {coef:.2f}')
intercept: 347289.66
coefficients:
bed: -30941.97
bath: 114300.17
zip_code: -18.78