0% found this document useful (0 votes)
6 views3 pages

WW-M1 Bernardo

The document discusses improving prediction model performance in machine learning through careful feature selection. It describes building a multiple linear regression model to predict car selling prices using variables like year. Initial metrics on the training data show the model fits reasonably well but leaves room for improvement, with a mean squared error of 125. Testing on new data finds stronger performance, suggesting the model generalizes well.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views3 pages

WW-M1 Bernardo

The document discusses improving prediction model performance in machine learning through careful feature selection. It describes building a multiple linear regression model to predict car selling prices using variables like year. Initial metrics on the training data show the model fits reasonably well but leaves room for improvement, with a mean squared error of 125. Testing on new data finds stronger performance, suggesting the model generalizes well.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Predictive Modeling Strategies

Danielito U. Bernardo
Bachelor of Science in Information Technology
Jose Rizal University
Mandaluyong City, Philippines
[email protected]

Abstract—This abstract outlines the significance of augmentation of prediction model performance lends itself to
attribute selection in the formulation of effective predictive improved interpret-ability. By pinpointing the most
models, with a particular emphasis on the application within informative attributes, a deeper comprehension of the
multiple linear regression frameworks. The careful underlying patterns and relationships within the data is
identification and selection of important features from a attained. This, in turn, enables the extraction of meaningful
datasets are pivotal in enhancing the accuracy, insights and facilitates more informed decision-making based
computational efficiency, and clarity of the predictive on the model's predictions. In summary, the advancement of
models. The use of correlation analysis as a fundamental prediction model performance in machine learning is
technique for attribute selection is emphasized, providing a indispensable for achieving precision in predictions,
pathway for isolating variables that significantly contribute computational efficiency, and interpret-ability. It empowers
to the predictive capability of the model. Furthermore, this the development of resilient and insightful models capable of
document details the procedural steps for constructing both furnishing accurate predictions while enriching our
simple and multiple linear regression models, leveraging the comprehension of the data.
meticulously selected variables. The overarching goal of this
research is to devise models that not only predict with high
B. Briefly summarize the prediction model built in the
accuracy but also yield insights and interpretations that are
previous exercise and its baseline performance
directly applicable to real-world data scenarios.

Keywords - Coefficient, Correlation, Linear Regression, In the previous exercise, a multiple linear regression model
Multiple Linear Regression, datasets, Models, Attributes, was developed to forecast the selling price of cars. This model
Values, predictive modeling utilizes carefully chosen variables with strong correlations as
predictors. Evaluating the model's initial performance involves
I. INTRODUCTION employing diverse metrics. Notably, the correlation coefficient
(r) between the year and selling price stands at -0.37,
A. Provide an overview of the exercise objective and the
indicating a slight negative correlation, implying a marginal
importance of improving prediction model performance
decline in selling price over time. Another important metric is
in machine learning.
the R-squared score, measuring the extent of variability in
selling price attributable to the linear model incorporating the
The objective of this exercise is to enhance the capability of year. With a training data R-squared score of 0.005, merely
prediction models in machine learning. This is accomplished 0.5% of the selling price variability can be attributed solely to
by meticulously selecting the most relevant attributes or the year variable. Furthermore, the model features an intercept
features from the datasets. Such an approach serves to support of -729.72, suggesting that if all predictor variables were zero,
the accuracy and efficiency of the models in their predictive the predicted selling price would be -729.72. Overall, the
capabilities. The enhancement of prediction model multiple linear regression model elucidates a notable portion
performance holds immense importance within the realm of of the selling price variance and demonstrates effective
machine learning for various reasons. Firstly, it facilitates the generalization to new data. Nevertheless, it's crucial to
generation of more precise predictions, a critical necessity acknowledge that the model's performance may be influenced
across diverse sectors such as finance, healthcare, and by data cleaning, validation procedures, and the potential for
marketing. These precise predictions aid businesses in making over-fitting.
well-informed decisions, streamlining processes, and
ultimately improving outcomes. Secondly, the improvement in
prediction model performance contributes to heightened
computational efficiency. Through the judicious selection of
relevant features, the model's complexity is streamlined,
resulting in enhanced memory usage and processing time
efficiency. This aspect is particularly significant when
grappling with sizable datasets or real-time applications where
speed is of paramount importance. Furthermore, the
II. INITIAL MODEL PERFORMANCE predictions are off by approximately 11.25 units from the
actual selling prices.
A. Describe the initial performance of the prediction model,
including evaluation metrics such as accuracy, precision,
The significance of these metrics can’t be overstated. A lower
recall, or mean squared error (MSE).
MSE value is indicative of a model that can closely mirror the
training data, suggesting a better fit. In layman's terms, these
initial metrics are encouraging because they show our model
has a strong grasp on the datasets, promising reliable
predictions of selling prices. However, it's crucial to remember
that these numbers are just the starting point. They set the
baseline for how well the model can be expected to perform,
and ideally, we'd like to see these errors reduce further as we
refine the model.

B. Present any insights gained from analyzing the model's


performance on the test datasets.

When we assess the model's performance, we look at key


metrics like the correlation coefficient and the R-squared
score. These metrics help us understand how accurately the
model predicts car selling prices. For instance, the correlation
coefficient between the year and the selling price is -0.37. This
suggests a small negative correlation, indicating a slight
decrease in selling price over time. Then there's the R-squared
score, which tells us how much of the selling price variability
can be explained by the model. In the training data, the R-
squared score is only 0.005, meaning just 0.5% of the
variability can be attributed to the year alone. But here's the
interesting part when we test the model on new data, the R-
squared score jumps to 0.864. This shows that the model
performs really well on unseen data, accurately predicting
selling prices for cars not in the training set, the model also
includes an intercept of -729.72, which is like a starting point
for predictions when all other variables are zero.

Overall, while the model does a good job on new data and
captures some relationship between the year and the selling
price, it's clear that relying solely on the year isn't enough. We
need to consider other factors to make more precise
predictions about car prices.
When we first tested the prediction model using our training
datasets, the results were promising. The Mean Squared Error
(MSE), which helps us understand how far off our predictions III. CHALLENGES FACED
are on average by measuring the square of the difference
between actual and predicted values, was recorded at 126.64. A. Identify and discuss the main challenges encountered
This figure might seem abstract at first glance, but it’s crucial during model evaluation, as discussed in the lecture on
for assessing the model's accuracy. In simpler terms, the MSE the main challenges of machine learning methods.
tells us that, on average, the model's predictions deviate from
the actual selling prices in a squared sense by this amount. During the exercise, several challenges were encountered in
the context of predictive modeling. One of the main challenges
Complementing the MSE, the Root Mean Squared Error discussed was dealing with missing values in the datasets. It
(RMSE) stood at 11.25. The RMSE is particularly insightful was acknowledged that if the datasets had missing values, it
because it brings the scale of our errors back down to the could have been difficult to decide on an appropriate strategy.
original units of our target variable, making it easier to
interpret. An RMSE of 11.25 means that typically, the model's Another challenge mentioned was inconsistent data formats or
errors in measurement, which can lead to noise in the model.
V. IMPROVED MODEL PERFORMANCE
The issue of over-fitting, where the model fits the training data
A. Present the updated performance metrics of the
too closely and fails to generalize to new data, was also
prediction model after applying the improvement
discussed as a challenge.
strategies.

B. For each challenge, provide specific examples from the


datasets or model evaluation process

B. Provide a comparative analysis between the initial and


When evaluating models with datasets that contain missing
improved model performance.
values, determining a suitable strategy can pose a challenge.
For instance, if the car datasets has missing values in the
"mileage" attribute, addressing these gaps during the model C. Discuss what you think are implications of your
evaluation process becomes necessary. This typically involves observation for real-world applications
handling missing values by means such as imputation with
either mean or median values, or employing more
sophisticated techniques like regression imputation.
VI. CONCLUSIONS
In the model evaluation phase, when the linear regression
model shows exceptional performance on the training datasets A. Summarize the key findings of the laboratory exercise,
but struggles with unseen data, it suggests over-fitting. For including the challenges faced, strategies applied, and
example, if the model accurately predicts car selling prices the impact on model performance.
within the training set but struggles to generalize to new data,
employing techniques like cross-validation becomes necessary
to curb over-fitting and enhance the model's generalization
ability.
B. Reflect on the importance of addressing model
challenges and continuous improvement in machine
When the linear regression model lacks complexity and cannot
learning projects
adequately capture the inherent patterns within the car
datasets, it leads to under-fitting. Consequently, the model
exhibits high bias and low variance, resulting in subpar
performance on both the training and test datasets. To address
under-fitting, enhancements to the feature engineering process, REFERENCES
the inclusion of additional pertinent features, or the adoption [1] Salim, F., & Abu, N. A. (2021). Used car price estimation:
of more intricate models may be necessary during the model Moving from linear regression towards a new s-curve model.
evaluation stage. International Journal of Business and Society, 22(3), 1174-
1187.
https://www.researchgate.net/publication/357130079_Used_
Car_Price_Estimation_Moving_from_Linear_Regression_to
IV. STRATEGIES APPLIED wards_a_New_S-Curve_Model
A. Discuss and show how you built both the simple and
multiple linear models using the selected variables. [2] Sharma, A. D., & Sharma, V. (2020). Used Car Price
Prediction using Linear Regression Model. IRJMETS, 2,
946-953.
https://www.irjmets.com/uploadedfiles/paper/volume2/issue
_11_november_2020/4868/1628083194.pdf

B. Show model performance evaluation for both models and [3] Sumeyra, M. U. T. İ., & YILDIZ, K. (2023). Using linear
discuss your interpretation of the results. regression for used car price prediction. International
Journal of Computational and Experimental Science and
Engineering, 9(1), 11-16.
https://www.researchgate.net/publication/369079425_Using_
. Linear_Regression_For_Used_Car_Price_Prediction

You might also like