0% found this document useful (0 votes)
2 views

ADS_LAB8

The document outlines the data science lifecycle for house price prediction, detailing steps from problem definition to model deployment and monitoring. It includes coding examples using Python for data processing, model training, and evaluation, demonstrating the relationship between house features and prices. The conclusion highlights the effectiveness of the regression model in predicting house prices based on area and furnishing status.

Uploaded by

abhijaysingh66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ADS_LAB8

The document outlines the data science lifecycle for house price prediction, detailing steps from problem definition to model deployment and monitoring. It includes coding examples using Python for data processing, model training, and evaluation, demonstrating the relationship between house features and prices. The conclusion highlights the effectiveness of the regression model in predicting house prices based on area and furnishing status.

Uploaded by

abhijaysingh66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

EXPERIMENT NO.

08
AIM: Illustrate the data science lifecycle for the selected case study. (Prepare case study
document for the selected case study)
THEORY:
House price prediction is the process of using data analysis and statistical techniques to forecast
the selling or buying price of a house. This prediction is typically based on various factors such
as location, size, number of bedrooms and bathrooms, neighborhood amenities, historical sales
data, economic indicators, and more.
The data science life cycle, in the context of house price prediction, typically follows these
steps:
1. Problem Definition: This phase involves understanding the objective of the prediction task.
For house price prediction, the goal is to estimate the selling or buying price of a house
accurately.
2. Data Acquisition: Data acquisition involves gathering relevant datasets that contain
information about houses such as features (e.g., square footage, number of bedrooms, location)
and their corresponding sale prices. Data can be obtained from various sources like real estate
websites, government databases, or through web scraping.
3. Data Cleaning and Preprocessing: Raw data often contains errors, missing values, or
inconsistencies that need to be addressed before analysis. Data cleaning involves removing
duplicates, handling missing values, and standardizing formats. Preprocessing involves
transforming the data into a suitable format for analysis, which may include feature scaling,
normalization, or encoding categorical variables.
4. Exploratory Data Analysis (EDA): EDA involves analyzing and visualizing the data to
understand patterns, relationships, and distributions. In the context of house price prediction,
EDA might include creating histograms, scatter plots, or correlation matrices to explore the
relationship between house features and prices.
5. Feature Engineering: Feature engineering involves selecting, creating, or transforming
features that are most relevant for the prediction task. This might include techniques such as
feature selection, dimensionality reduction, or creating new features based on domain
knowledge.
6. Model Selection and Training: In this phase, various machine learning algorithms are
evaluated and trained on the prepared dataset. Common algorithms for house price prediction
include linear regression, decision trees, random forests, and neural networks. The dataset is
typically split into training and testing sets to evaluate the performance of the models.
7. Model Evaluation: Models are evaluated using appropriate metrics such as mean squared
error (MSE), root mean squared error (RMSE), or mean absolute error (MAE). The
performance of different models is compared to select the one that provides the best prediction
accuracy.
8. Model Deployment: Once a satisfactory model is selected, it can be deployed into production
for making predictions on new, unseen data. This might involve creating an application
interface, API, or integrating the model into existing systems.
9. Monitoring and Maintenance: After deployment, it's important to monitor the model's
performance over time and update it as needed. This might involve retraining the model with
new data or adjusting its parameters to adapt to changing conditions.
Throughout this life cycle, data scientists apply their expertise in statistics, machine learning,
and domain knowledge to build accurate and reliable house price prediction models.
CODING:
import pandas as pd
file_path = '/content/Housing.csv'
df = pd.read_csv(file_path, encoding='latin1')
# print(df)
# df.info()
df.head()

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

data = pd.read_csv('/content/Housing.csv')
X = data[['bedrooms', 'mainroad'
y = data['price']

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42)
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(drop='first'), ['mainroad'])
],
remainder='passthrough'
)
model = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', LinearRegression())
])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)
results_df = pd.DataFrame({'Actual Price': y_test, 'Predicted Price':
y_pred})
plt.figure(figsize=(12, 8))

avg_predicted_prices =
results_df.groupby(X_test['bedrooms'])['Predicted
Price'].mean().reset_index()
sns.barplot(x='bedrooms', y='Predicted Price',
data=avg_predicted_prices)
plt.xlabel('Number of Bedrooms')
plt.ylabel('Average Predicted Price')
plt.title('Average Predicted Prices based on Number of Bedrooms')
plt.show()

X = data[['area']]
y = data['price']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)
plt.figure(figsize=(12, 8))
plt.scatter(X_test, y_test, color='blue', label='Actual Price')
plt.scatter(X_test, y_pred, color='red', label='Predicted Price')
plt.xlabel('Area')
plt.ylabel('House Price')
plt.title('Actual vs Predicted House Prices based on Area')
plt.legend()
plt.show()

X = data[['area', 'furnishingstatus']] # Features: area, furnishing


status
y = data['price'] # Target variable: house price
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(drop='first'), ['furnishingstatus'])
],
remainder='passthrough')

model = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', LinearRegression())])

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)
results_df = pd.DataFrame({'Area': X_test['area'], 'Furnishing Status':
X_test['furnishingstatus'], 'Actual Price': y_test, 'Predicted Price':
y_pred})
plt.figure(figsize=(12, 8))
sns.scatterplot(data=results_df, x='Area', y='Actual Price',
hue='Furnishing Status', palette='deep')
plt.scatter(results_df['Area'], results_df['Predicted Price'],
color='black', marker='x', label='Predicted Price')
plt.xlabel('Area')
plt.ylabel('House Price')
plt.title('Actual House Prices based on Area and Furnishing Status')
plt.legend()
plt.show()

CONCLUSION: The visualization suggests a positive correlation between house prices and
area, with larger houses generally commanding higher prices. Furnishing status also appears to
influence prices, with furnished houses tending to have higher prices compared to semi-
furnished or unfurnished ones. The regression model's predicted prices closely align with actual
prices in most cases, indicating its effectiveness in capturing the relationship between area,
furnishing status, and house prices. These insights can inform pricing strategies and marketing
approaches in the real estate market.

You might also like