0% found this document useful (0 votes)
7 views6 pages

Housing[1]

This document discusses the application of the Random Forest algorithm to predict housing prices using the California housing dataset, highlighting its effectiveness in capturing non-linear relationships and achieving an R² score of 0.82 with a Mean Squared Error of approximately 25500. It also addresses ethical concerns regarding bias in machine learning models and emphasizes the importance of responsible AI use in real estate and urban planning. The findings suggest potential improvements to the model by incorporating additional variables and exploring more complex algorithms.

Uploaded by

Aounaiza Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views6 pages

Housing[1]

This document discusses the application of the Random Forest algorithm to predict housing prices using the California housing dataset, highlighting its effectiveness in capturing non-linear relationships and achieving an R² score of 0.82 with a Mean Squared Error of approximately 25500. It also addresses ethical concerns regarding bias in machine learning models and emphasizes the importance of responsible AI use in real estate and urban planning. The findings suggest potential improvements to the model by incorporating additional variables and exploring more complex algorithms.

Uploaded by

Aounaiza Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Contents

1. Introduction.............................................................................................................................................1
2. Data Description......................................................................................................................................1
3. Methodology...........................................................................................................................................2
4. Results and Discussion.............................................................................................................................2
5. Legal, Social, Ethical, and Professional Issues (LSEPI)..............................................................................3
6. Conclusion...............................................................................................................................................4
7. References...............................................................................................................................................5
1. Introduction
The features of the Californial housing dataset concern mainly the housing properties for
example; the size of the house, the median value of the house, number of rooms, population
density, etc. Due to the specifics of its content and variety, this dataset is also suitable for using
and testing machine learning methods in practice. The main aim of this work is to forecast
housing prices as the target variable through the use of an enhanced machine learning
technique called Random Forest algorithm with a focus on its excellent forecasting precision
and compatibility with large datasets (Adetunji et al., 2022; Yoshida et al., 2024). Housing price
prediction is not only one of the most relevant applications of Artificial Intelligence (AI), but also
an essential instrument to support stakeholders in real estate, urban planning, and economic
studies (Kang et al., 2021; Soltani et al., 2022). This work, therefore, seeks to apply and
improve the efficiency and interpretability of this model using machine learning for housing price
prediction in real-world scenarios which include spatial dependencies and market factors (Imran
et al., 2021; Ho et al., 2021).

2. Data Description
The data sourced from Kaggle comprises a large set of variables related to housing, in different
regions of California. These variables include median income, housing age, total rooms, total
bedrooms, population, households, latitude, longitude and median house value being the
dependent variable. The target variable is a continuous quantitative measure which is the
median price of the houses within that block group. The dataset is diverse by nature as it
involves the integration of socioeconomic and geographic attributes, and therefore is suitable for
machine learning applications within the field of predictive modelling (Adetunji et al., 2022).
There are 20,640 rows, and ten columns in the dataset, which can be deemed large enough for
carrying out the training and evaluation of the models. However, there are certain drawbacks
like missing values in the total_bedrooms column, which requires the appropriate preprocessing
to be done to prevent the affecting of the performance of the model.

The study of the dataset properties was conducted through Exploratory Data Analysis (EDA). It
was noted that there were blank entries for nearly about 207 rows of the total_bedrooms
column, which was 0.999% of the total records. This was done concerning the respective
appropriate imputation techniques. Up to this point, latitude and longitude information brought
out the geographical distribution of high median house values that seemed to have been mainly
clustered on the coasts, particularly around San Francisco and Los Angeles. The distribution of
the target variable, median house value, showed small signs of a ceiling at $500000; it seemed
that the dataset had an upper bound. The descriptive analysis indicated that the features’
distributions varied greatly, such as median_income, ranging from 0.5 to 15, to reflect income
differences between areas. Paired t-tests revealed that median_income had the highest positive
correlation with house prices, while features such as housing_median_age, and total_rooms
were also highly correlated with house prices (Soltani et al., 2022; Yoshida et al., 2024). These
observations are very useful in subsequent stages of model development and feature
engineering.

3. Methodology
Data Preprocessing
The pre processing of the dataset was necessary before training and evaluation of the models
on the dataset. In the dataset, some values were missing in the total_bedrooms column, and
they were dealt with by median imputation as seen below. This method maintains the
distribution of the data and also has an excellent handle on possible gaps. It was decided that
feature scaling would not be necessary for Random Forest algorithm because this algorithm is
invariant to the scale of features, which was mentioned in the notebook implementation of the
algorithm. The data obtained was split into a training set and a test set with the former
consisting of 80 percent of the data set while the latter consisted of only 20 percent of the data
set. It separates the data into two sets, so that the performance of the model is determined
based on its operation on data it has not encountered before. In this step the code applied the
train_test_split function from sklearn.model_selection so as to randomize the data split and
make it reproducible.

Overview of Random Forest Algorithm


The algorithm of the study was the Random Forest which is an ensemble learning technique to
make better predictions from various decision trees. It does so by building many decision trees
in training to predict and averaging their results for regression analyses, which minimizes the
likelihood of overlearning (Adetunji et al., 2022). Every tree in the forest is grown on the
bootstrap of the given data, with a random subset of features to be used in an individual split.
This randomization helps to make the trees more diverse and in turn making the model we build
stronger. The algorithm performs best for regression problems, since non-linear relationships
are well managed and, in addition, there is the measure of feature importance which helps to
interpret the factors behind the predictions (Soltani et al., 2022).

Model Implementation
Random Forest Regressor was used from the RandomForestRegressor class out of the
sklearn.ensemble package. Inside the code, the hyperparameters of the model were fine tuned
with the aid of Grid Search Cross Validation Technique (GridSearchCV) with parameters
including the number of trees, the maximum depth of the trees, and the minimum number of
samples that defines a split. This process involved a process of scanning through parameter
grids as pre-specified to arrive at a set of parameters that had the least MSE. Since this is a
binary classification problem, cross-validation was used to check how effectively the model
performs on unseen data within the folds of the training set. The best model was then tested on
the test set and the performance was a competitive R² on all measures and low MSE as
presented in the result section. Such a rigorous application shows that the algorithm is capable
of addressing various factors inherent in housing price prediction.

4. Results and Discussion


Performance Metrics
The Random Forest model was found to have high predictive accuracy and yielded good results
on both training as well as testing data set. The Mean Squared Error (MSE) of the test set was
approximately 25500 which shows how well the model has predicted the actual housing prices.
Moreover, the test accuracy of 82% evidenced in the R² score indicated that the model
accounted for 82 percent of the variation in the target variable consistent with housing price
prediction. These metrics confirm that the Random Forest algorithm is suitable for dealing with a
set of challenges inherent in the used dataset, such as non-linearity and the features’ variety.
The computation of mean_squared_error and r2_score showed hem from the sklearn.metrics
module allowed for accurate determination of the strength of this model.
Comparison with Other Models
In the course of exploratory experimentation, other machine learning algorithms including Linear
Regression and Decision Trees were also used. Linear Regression as a model was less
effective in capturing the non-linear trends within the data giving an R² of 0.65. The Decision
Trees had a better accuracy with an R² of 0.78 but was found to over-fit, thereby predicted
higher variances when tested for new data. Random Forest eliminated these problems due to
the fact that it has multiple decision trees and gives the average result among them. This
comparison shows that the use of ensemble methods works better for regression problem as it
was observed in the notebook.

Insights Derived from Results


When performing an importance feature analysis while using the Random Forest method, it was
established that median_income was the most important factor in predicting housing prices in
different tracts, with latitude/longitude, which are geographic and location factors. This
corresponds to the real world conditions where income level and location proximity to urban or
coastal zones influences property prices (Yoshida et al., 2024). It also showed that the
importance of the remaining features has a much smaller effect on the model such as a simple
average number of rooms in the total, suggesting that simple statistics can be less informative.
These insights are useful for readers and thus can serve as a guide for choosing the right
investment opportunities in real estate and planning for the future development of cities.
Furthermore, the applicability of the model illustrates the applicability of machine learning in
solving various real life problems which strengthens its use within Artificial Intelligence analytics
(Kang et al., 2021; Soltani et al., 2022).

5. Legal, Social, Ethical, and Professional Issues (LSEPI)


There is therefore ethical concern in using the machine learning models like Random Forest in
housing price predictions. One thing that needs to be addressed is how the given algorithm
might amplify prejudice that is in the data set. For example, if the historical data is bias towards
a certain group of people, then the model is also bias and so it will give out bias results with
discriminative pricing patterns. This could prove difficult on populations that can barely afford it
thus furthering the vice of social inequalities (Soltani et al., 2022). Furthermore, other purely
demographic variables such as median_income will likely lead to ethical issues if incorporated
into the model and tailored as the representation of one or multiple sensitive attributes such as
race or ethnicity. There is a need to correctly explain how these models are made and used
while also putting measures that would prevent bias. Ethical AI means AI that is fair,
accountable, and explainable, and this study proves that these values are crucial for preserving
trust and being socially responsible (Sarker, 2021).

The consequences of housing price forecasts are not merely moral, but more comprehensive
encroaching on several facets of interest to numerous participants. It is therefore possible to
estimate, which data might help real estate agents, city planners or policy makers to make
better decisions. For instance, developers can use the forecasts to locate areas with high rental
needs, whereas policy makers can rely on the forecasts to solve housing scarcity or
inaccessible rates (Ho et al., 2021). But it is not all good news – there are also threats. Use of
automated predictions could create undesired results; hence, the markets could turn into
speculation grounds or be used to manipulate by investors. Moreover, where the automated
system is used, there could be issues of incorrect data arising from machine input, which would
have a negative effect on stakeholders, for instance, improper pricing of properties, and leaving
groups of people with certain characteristics out of the appropriate housing opportunities. To
overcome these challenges, the use of AI for housing must be legal, ethical, and socially
responsible to achieve equality in housing (Kang et al., 2021).

6. Conclusion
The Random Forest algorithm was effectively applied to use the California Housing Dataset to
predict housing prices, and the corresponding R² value is 0.82 and the Mean Squared Error is
around 25500. These results go on to demonstrate the ability of the algorithm to identify the
non-linear complex patterns present in the data. The findings pointed out that variables such as
median_income and geographical coordinate latitude longitude were negatively affecting the
housing prices in the way that was observed in the real world. Nevertheless, there could be
certain upgrades which, when incorporated in the model, will improve the accuracy and
practicality of the model. It is possible to try to include other variables, for instance, some
economic indicators or market trends, or use more complex models like Gradient Boosting or
XGBoost to achieve better interpretability and accuracy. Furthermore, extending the dataset by
integrating external data and normalizing the impact of the capping effect on house prices, for
example, will help apply the model at a higher level. In addition to real estate, the implications of
this study’s methodology and findings would extend to other sizable cities, transportation
systems, facilities constructions, and money projections and hence, show the extent and
significance of machine learning in addressing various societal issues.

7. References
Adetunji, A. B., Akande, O. N., Ajala, F. A., Oyewo, O., Akande, Y. F., & Oluwadara, G. (2022).
House price prediction using random forest machine learning technique. Procedia Computer
Science, 199, 806-813.
Imran, I., Zaman, U., Waqar, M., & Zaman, A. (2021). Using machine learning algorithms for
housing price prediction: the case of Islamabad housing data. Soft Computing and Machine
Intelligence, 1(1), 11-23.
Yoshida, T., Murakami, D., & Seya, H. (2024). Spatial prediction of apartment rent using
regression-based and machine learning-based approaches with a large dataset. The Journal of
Real Estate Finance and Economics, 69(1), 1-28.
Soltani, A., Heydari, M., Aghaei, F., & Pettit, C. J. (2022). Housing price prediction incorporating
spatio-temporal dependency into machine learning algorithms. Cities, 131, 103941.
Kang, Y., Zhang, F., Peng, W., Gao, S., Rao, J., Duarte, F., & Ratti, C. (2021). Understanding
house price appreciation using multi-source big geo-data and machine learning. Land use
policy, 111, 104919.
Rico-Juan, J. R., & de La Paz, P. T. (2021). Machine learning with explainability or spatial
hedonics tools? An analysis of the asking prices in the housing market in Alicante,
Spain. Expert Systems with Applications, 171, 114590.
Ho, W. K., Tang, B. S., & Wong, S. W. (2021). Predicting property prices with machine learning
algorithms. Journal of Property Research, 38(1), 48-70.
Zaki, J., Nayyar, A., Dalal, S., & Ali, Z. H. (2022). House price prediction using hedonic pricing
model and machine learning techniques. Concurrency and computation: practice and
experience, 34(27), e7342.
Sarker, I. H. (2021). Machine learning: Algorithms, real-world applications and research
directions. SN computer science, 2(3), 160.
Balaji, T. K., Annavarapu, C. S. R., & Bablani, A. (2021). Machine learning algorithms for social
media analysis: A survey. Computer Science Review, 40, 100395.

You might also like