Housing[1]
Housing[1]
1. Introduction.............................................................................................................................................1
2. Data Description......................................................................................................................................1
3. Methodology...........................................................................................................................................2
4. Results and Discussion.............................................................................................................................2
5. Legal, Social, Ethical, and Professional Issues (LSEPI)..............................................................................3
6. Conclusion...............................................................................................................................................4
7. References...............................................................................................................................................5
1. Introduction
The features of the Californial housing dataset concern mainly the housing properties for
example; the size of the house, the median value of the house, number of rooms, population
density, etc. Due to the specifics of its content and variety, this dataset is also suitable for using
and testing machine learning methods in practice. The main aim of this work is to forecast
housing prices as the target variable through the use of an enhanced machine learning
technique called Random Forest algorithm with a focus on its excellent forecasting precision
and compatibility with large datasets (Adetunji et al., 2022; Yoshida et al., 2024). Housing price
prediction is not only one of the most relevant applications of Artificial Intelligence (AI), but also
an essential instrument to support stakeholders in real estate, urban planning, and economic
studies (Kang et al., 2021; Soltani et al., 2022). This work, therefore, seeks to apply and
improve the efficiency and interpretability of this model using machine learning for housing price
prediction in real-world scenarios which include spatial dependencies and market factors (Imran
et al., 2021; Ho et al., 2021).
2. Data Description
The data sourced from Kaggle comprises a large set of variables related to housing, in different
regions of California. These variables include median income, housing age, total rooms, total
bedrooms, population, households, latitude, longitude and median house value being the
dependent variable. The target variable is a continuous quantitative measure which is the
median price of the houses within that block group. The dataset is diverse by nature as it
involves the integration of socioeconomic and geographic attributes, and therefore is suitable for
machine learning applications within the field of predictive modelling (Adetunji et al., 2022).
There are 20,640 rows, and ten columns in the dataset, which can be deemed large enough for
carrying out the training and evaluation of the models. However, there are certain drawbacks
like missing values in the total_bedrooms column, which requires the appropriate preprocessing
to be done to prevent the affecting of the performance of the model.
The study of the dataset properties was conducted through Exploratory Data Analysis (EDA). It
was noted that there were blank entries for nearly about 207 rows of the total_bedrooms
column, which was 0.999% of the total records. This was done concerning the respective
appropriate imputation techniques. Up to this point, latitude and longitude information brought
out the geographical distribution of high median house values that seemed to have been mainly
clustered on the coasts, particularly around San Francisco and Los Angeles. The distribution of
the target variable, median house value, showed small signs of a ceiling at $500000; it seemed
that the dataset had an upper bound. The descriptive analysis indicated that the features’
distributions varied greatly, such as median_income, ranging from 0.5 to 15, to reflect income
differences between areas. Paired t-tests revealed that median_income had the highest positive
correlation with house prices, while features such as housing_median_age, and total_rooms
were also highly correlated with house prices (Soltani et al., 2022; Yoshida et al., 2024). These
observations are very useful in subsequent stages of model development and feature
engineering.
3. Methodology
Data Preprocessing
The pre processing of the dataset was necessary before training and evaluation of the models
on the dataset. In the dataset, some values were missing in the total_bedrooms column, and
they were dealt with by median imputation as seen below. This method maintains the
distribution of the data and also has an excellent handle on possible gaps. It was decided that
feature scaling would not be necessary for Random Forest algorithm because this algorithm is
invariant to the scale of features, which was mentioned in the notebook implementation of the
algorithm. The data obtained was split into a training set and a test set with the former
consisting of 80 percent of the data set while the latter consisted of only 20 percent of the data
set. It separates the data into two sets, so that the performance of the model is determined
based on its operation on data it has not encountered before. In this step the code applied the
train_test_split function from sklearn.model_selection so as to randomize the data split and
make it reproducible.
Model Implementation
Random Forest Regressor was used from the RandomForestRegressor class out of the
sklearn.ensemble package. Inside the code, the hyperparameters of the model were fine tuned
with the aid of Grid Search Cross Validation Technique (GridSearchCV) with parameters
including the number of trees, the maximum depth of the trees, and the minimum number of
samples that defines a split. This process involved a process of scanning through parameter
grids as pre-specified to arrive at a set of parameters that had the least MSE. Since this is a
binary classification problem, cross-validation was used to check how effectively the model
performs on unseen data within the folds of the training set. The best model was then tested on
the test set and the performance was a competitive R² on all measures and low MSE as
presented in the result section. Such a rigorous application shows that the algorithm is capable
of addressing various factors inherent in housing price prediction.
The consequences of housing price forecasts are not merely moral, but more comprehensive
encroaching on several facets of interest to numerous participants. It is therefore possible to
estimate, which data might help real estate agents, city planners or policy makers to make
better decisions. For instance, developers can use the forecasts to locate areas with high rental
needs, whereas policy makers can rely on the forecasts to solve housing scarcity or
inaccessible rates (Ho et al., 2021). But it is not all good news – there are also threats. Use of
automated predictions could create undesired results; hence, the markets could turn into
speculation grounds or be used to manipulate by investors. Moreover, where the automated
system is used, there could be issues of incorrect data arising from machine input, which would
have a negative effect on stakeholders, for instance, improper pricing of properties, and leaving
groups of people with certain characteristics out of the appropriate housing opportunities. To
overcome these challenges, the use of AI for housing must be legal, ethical, and socially
responsible to achieve equality in housing (Kang et al., 2021).
6. Conclusion
The Random Forest algorithm was effectively applied to use the California Housing Dataset to
predict housing prices, and the corresponding R² value is 0.82 and the Mean Squared Error is
around 25500. These results go on to demonstrate the ability of the algorithm to identify the
non-linear complex patterns present in the data. The findings pointed out that variables such as
median_income and geographical coordinate latitude longitude were negatively affecting the
housing prices in the way that was observed in the real world. Nevertheless, there could be
certain upgrades which, when incorporated in the model, will improve the accuracy and
practicality of the model. It is possible to try to include other variables, for instance, some
economic indicators or market trends, or use more complex models like Gradient Boosting or
XGBoost to achieve better interpretability and accuracy. Furthermore, extending the dataset by
integrating external data and normalizing the impact of the capping effect on house prices, for
example, will help apply the model at a higher level. In addition to real estate, the implications of
this study’s methodology and findings would extend to other sizable cities, transportation
systems, facilities constructions, and money projections and hence, show the extent and
significance of machine learning in addressing various societal issues.
7. References
Adetunji, A. B., Akande, O. N., Ajala, F. A., Oyewo, O., Akande, Y. F., & Oluwadara, G. (2022).
House price prediction using random forest machine learning technique. Procedia Computer
Science, 199, 806-813.
Imran, I., Zaman, U., Waqar, M., & Zaman, A. (2021). Using machine learning algorithms for
housing price prediction: the case of Islamabad housing data. Soft Computing and Machine
Intelligence, 1(1), 11-23.
Yoshida, T., Murakami, D., & Seya, H. (2024). Spatial prediction of apartment rent using
regression-based and machine learning-based approaches with a large dataset. The Journal of
Real Estate Finance and Economics, 69(1), 1-28.
Soltani, A., Heydari, M., Aghaei, F., & Pettit, C. J. (2022). Housing price prediction incorporating
spatio-temporal dependency into machine learning algorithms. Cities, 131, 103941.
Kang, Y., Zhang, F., Peng, W., Gao, S., Rao, J., Duarte, F., & Ratti, C. (2021). Understanding
house price appreciation using multi-source big geo-data and machine learning. Land use
policy, 111, 104919.
Rico-Juan, J. R., & de La Paz, P. T. (2021). Machine learning with explainability or spatial
hedonics tools? An analysis of the asking prices in the housing market in Alicante,
Spain. Expert Systems with Applications, 171, 114590.
Ho, W. K., Tang, B. S., & Wong, S. W. (2021). Predicting property prices with machine learning
algorithms. Journal of Property Research, 38(1), 48-70.
Zaki, J., Nayyar, A., Dalal, S., & Ali, Z. H. (2022). House price prediction using hedonic pricing
model and machine learning techniques. Concurrency and computation: practice and
experience, 34(27), e7342.
Sarker, I. H. (2021). Machine learning: Algorithms, real-world applications and research
directions. SN computer science, 2(3), 160.
Balaji, T. K., Annavarapu, C. S. R., & Bablani, A. (2021). Machine learning algorithms for social
media analysis: A survey. Computer Science Review, 40, 100395.