DS-203: E2 Assignment - Linear Regression Report: Sahil Barbade (210040131) 29th Jan 2024
DS-203: E2 Assignment - Linear Regression Report: Sahil Barbade (210040131) 29th Jan 2024
Abstract
This report presents a comprehensive analysis of three datasets, namely E2-set1.csv, E2-
set2.csv, and E2-set3.csv, with a focus on understanding the interplay between data quality,
sample size, and various metrics in the context of Linear Regression. The study involves pre-
processing the datasets to assess their characteristics, comparing their quality, and subsequently
applying Linear Regression models to investigate the impact of different sample sizes.
Introduction
This report explores three datasets—E2-set1.csv, E2-set2.csv, and E2-set3.csv—focusing on how
data quality, sample size, and various metrics impact Linear Regression. We start by examining
and comparing dataset characteristics through preprocessing.
Moving on, we analyze the effect of different sample sizes (5 to 100) on Linear Regression
metrics. This includes investigating coefficients, p-values, Confidence Intervals, and key metrics
like R2, MSE, and F-Statistic. Our analysis goes beyond surface-level observations, incorporating
a thoughtful exploration grounded in theory.
We also review the provided Python notebook, E2-process-data.ipynb, to understand its flow
and outputs, providing a foundational understanding for subsequent analyses.
Specific datasets, like E2-set1.csv, are scrutinized for peculiar trends, such as the stagnant av-
erage R2 value despite an increase in sample size. Similar attention is given to E2-set3.csv, where
a surprisingly high average R2 value with a small sample size prompts thoughtful conclusions.
In summary, this report not only fulfills exercise requirements but also serves as a practical
guide to navigating the complexities of Linear Regression with different datasets and sample sizes.
The subsequent sections present detailed analyses, critical observations, and practical guidelines
for evaluating and accepting Linear Regression models.
PART - A
In our analysis of the three provided datasets, we aim to delve into their distinctive characteristics.
To achieve a comprehensive understanding, we address the following inquiries:
1
• Mean Squared Error (MSE):MSE measures the average squared difference between actual
and predicted values.A lower MSE indicates that the model’s predictions are closer to the
actual values, representing better accuracy.
• Residual Analysis (QQ Plot, Density Plot, Scatter Plot of Errors): Residual analysis assesses
the distribution and patterns of errors after training the model.A centered distribution of
errors around zero, as seen in QQ plots and density plots, indicates that the model’s predic-
tions are accurate. Scatter plots of errors help visually identify any patterns or trends in the
residuals.
• Pearson Correlation Coefficient (r): The Pearson correlation coefficient quantifies the degree
of linear correlation between two variables. The value of r ranges from -1 to 1. A positive
value (closer to 1) indicates a positive linear correlation (as one variable increases, the other
tends to increase). A negative value (closer to -1) indicates a negative linear correlation
(as one variable increases, the other tends to decrease). A value of 0 indicates no linear
correlation.The closer the absolute value of r is to 1, the stronger the linear relationship. The
sign indicates the direction of the relationship.
Significance Metrics
• F-Statistic: The F-Statistic tests the overall significance of the regression model. A higher
F-Statistic suggests that the linear regression model is statistically significant, indicating that
at least one independent variable has a nonzero effect on the dependent variable.
• P-values for Coefficients (Intercept and Slope):P-values indicate the probability that the
observed results (or more extreme) could have occurred by chance. Lower p-values (typically
below 0.05) suggest that the corresponding coefficients are statistically significant. In the
context of linear regression, this means the intercept and slope are likely not zero.
A higher pearson correlation coefficient suggests a more predictable and consistent linear re-
lationship between the variables.For linear regression, a strong correlation is often desirable as
it implies that the independent variable can explain a significant portion of the variance in the
dependent variable. However, while a high pearson correlation is an indicator of a potential good
fit for linear regression, other factors (such as statistical significance of coefficients, normality of
residuals, etc.) should also be considered.
2
Data Distribution and SLR Fitting
• Dataset 1: The scatter plot for datset 1 illustrates a wide dispersion of data points around
the regression line, indicating considerable variability in the relationship between the inde-
pendent and dependent variables. The scattered distribution suggests the possible presence
of heteroscedasticity or subtle non-linear patterns in the data. These deviations from a per-
fectly linear relationship in Set 1 warrant further investigation to comprehend the underlying
dynamics influencing the observed variability.
• Dataset 2: In contrast, the scatter plot for datset 2 shows a more tightly clustered group
of data points around the regression line, indicating a strong and clear linear relationship
with reduced scatter. This concentrated clustering suggests lower variability and a consistent
linear pattern between the variables. The cohesive nature of the data points in Set 2 indicates
a higher level of predictability, making it an ideal candidate for linear regression modeling.
The reduced scatter implies a stronger correlation between the variables, contributing to a
more dependable predictive model.
3
• Dataset 3: Moving on to datset 3, the scatter plot demonstrates an even more concentrated
distribution of data points around the regression line compared to Sets 1 and 2. This suggests
an extremely consistent linear relationship with minimal scatter. The clustered distribution
of data points in Set 3 indicates minimal variability, pointing toward an almost perfect
linear pattern in the relationship between the variables. The precision and tightness of this
relationship in Set 3 make it an intriguing dataset, potentially reflecting a well-defined and
predictable relationship between the variables. The minimal scatter suggests a high level of
homoscedasticity and reinforces the reliability of this linear model.
Dataset 2
Dataset 3
• From Scatter plt, one thing to note is there’s no clear funnel shape, which is good because a
funnel shape would indicate heteroscedasticity, where the variance of residuals changes with
the level of prediction.
• In Quantile-Quantile Plot, there are some deviations from the line at the ends of the plot,
which may indicate that the extreme values of the residuals do not perfectly align with the
normal distribution (potentially some light tails).
4
• Plots indicate that the residuals from the SLR on Dataset 1 are mostly randomly distributed,
centered around zero (mean), generally follow a normal distribution, and do not exhibit clear
signs of heteroscedasticity. There are minor deviations from normality, especially in the tails,
but these might not be significant depending on the context and the specifics of the analysis.
Analysis Comparision
• Mean Squared Error (MSE): Data 3 has the lowest MSE (0.00975), indicating that the
model’s predictions are closest to the actual values when compared to Data 1 (0.548419) and
Data 2 (0.243742). This suggests that the model built using Data 3 has the smallest average
squared differences between predicted and actual values.
• R-squared (R2): R2 measures the proportion of the variance in the dependent variable that
is predictable from the independent variable(s). Data 3 has the highest R2 (0.981812),
indicating that it explains a significant portion of the variance in the dependent variable,
followed by Data 2 (0.672446) and Data 1 (0.469074). This suggests that the model based
on Data 3 provides the best fit to the data.
• F-Statistic: The F-Statistic tests the overall significance of the regression model. Data 3
has an exceptionally high F-Statistic (53872.290354), indicating a strong overall fit of the
model, surpassing both Data 1 (881.733866) and Data 2 (2048.830373). This suggests that
the overall regression model based on Data 3 is highly significant.
• Coefficients and P-values:All datasets have highly statistically significant coefficients, as ind
icated by the very low p-values. However, Data 3 stands out with p-values of 0.0 for both
coefficients, indicating an extremely high level of significance.
• Confidence Intervals: The confidence intervals for coefficients in all datasets are relatively
narrow, suggesting a high level of precision in the estimates. This indicates a high level of
confidence in the range of values within which the true population parameters are likely to
fall.
Conclusion
• In conclusion, the comparison of regression models across three datasets reveals distinct
performance differences. Data 3 stands out as the top-performing dataset, with the lowest
mean squared error (MSE) and the highest coefficient of determination (R2), indicating the
best fit for the regression model. Additionally, its F-Statistic and p-values for coefficients are
substantially higher, signifying a high level of statistical significance.
• Data 2 also demonstrates strong performance, particularly in terms of R2 and F-Statistic,
suggesting a good fit for the regression model and high statistical significance. However,
Data 1 lags behind in these metrics, indicating a relatively weaker fit and lower statistical
significance.
5
• Furthermore, all datasets exhibit narrow confidence intervals for coefficients, indicating a
high level of precision in the estimates and a high degree of confidence in the range of values
within which the true population parameters are likely to fall.
2 Sample-Wise Summary
Dataset 1
6
Figure 8: Dataset 1 : Sample Size 10
7
Figure 10: Dataset 1 : Sample Size 50
8
Dataset 2
9
Figure 14: Dataset 2 : Sample Size 20
10
Figure 16: Dataset 2 : Sample Size 100
11
Dataset 3
12
Figure 19: Dataset 3 : Sample Size 20
13
Figure 21: Dataset 3 : Sample Size 100
3 Significant findings
3.1 Dataset 1
1] The fact that the average R2 value does not improve even when increasing the sample size from
10 to 100 could be due to some factors:
• The underlying relationship between the independent and dependent variable is weak or
non-linear, and the model may be underfitted to the data.
• The additional data does not provide new information that helps better explain the variation
in the dependent variable.
• There is high variability in the dataset that is not explained by the model, possibly due to
high levels of noise in the data.
• The model may not be the best option for the data, suggesting a different type of model or
additional explanatory variables may be needed.
Since the R2 value remains low even with a larger sample size, it may indicate that the models
do not adequately capture the underlying relationship, making them less reliable for predictions
or understanding the data.
14
Figure 22: Metrics Variation over Sample Size
2] From average metrics variation like (R2,MSE,p-value etc) we can conclude these insights :
• The mean squared error (MSE) decreases with larger sample sizes, indicating the model’s
predictions become more accurate on average.
• The R-squared (R2) value increases with more data, meaning a higher proportion of the
variability in the target variable is explained by the model.
• In terms of usability, the trends suggest the models will perform better - providing more
reliable and accurate predictions - when trained on a larger sample of data.
• Whether to use the models in a real-world application would depend on factors like acceptable
error thresholds, the project goals, and costs of data collection versus gains in precision.If
performance continues improving and meets requirements, a model trained on the largest
available sample size would likely be the best choice.
3] In terms of the usability of the models, the decreasing trend in the significance values indicates
the model parameters become more meaningful as the sample size increases. This suggests the
model reliability and predictive ability may improve with larger datasets. If the significance drops
below common thresholds like 0.05, it would be reasonable to consider using the model.
15
Figure 24: Model’s Significance over Samples
4] As the sample size increases, the significance value of the simple linear regression (SLR)
model also increases, showing a positive correlation. This indicates that the model is becoming
more statistically significant as the sample size gets larger.Generally speaking, a higher F-statistic
value suggests the variation explained by the model is likely not due to chance.
5] As the sample size increases, the confidence intervals for both the constant and slope decrease,
indicating more precise estimates. The blue line for constant significance shows a steep initial
decline, then levels off, suggesting marginal gains from further increasing sample size. The orange
dotted line for slope significance follows a similar pattern but with generally lower values.
3.2 Dataset 3
1] The average R2 value (average across samples for a given sample size), does not improve even
when the sample size is increased from 10 to 100 due to these possible factors :
• The MSE appears to be constant across all sample sizes with a value slightly above 0. On
the other hand, the R2 is also constant but with a value slightly below 1.0, indicating an
almost perfect or near perfect fit.
• A low MSE close to 0, as shown in the graph, indicates that the model predictions are very
close to the actual data points, which is desirable or may be undesirable due to overfitting
scenarios.An R2 value close to 1 indicates a model that explains almost all the variability of
the response data around its mean. A high R2 value is generally considered indicative of a
good fit.
• Given that these metrics show consistent and ideal performance across all sample sizes, the
conclusions that can be drawn about the usefulness of the models based on this dataset are:
16
• The model appears to be highly accurate as indicated by the low MSE.The model has excel-
lent predictive power as indicated by the high R2.Therefore, based on this dataset and the
provided metrics, it seems justified to use this model.
• However, it is important to note that this is an idealized scenario which is unlikely in real-
world data. This leads to some skepticism as models rarely perform perfectly across different
sample sizes unless the data is very simple or the model is overfit. In practice, other factors
like data complexity, number of features, overfitting, data quality and how the model gener-
alizes to unseen data are also important to consider.
2]The significance value drops sharply as sample size increases, then levels off close to zero.A lower
significance value typically means the slope coefficient is statistically significant, suggesting the
explanatory variable has a significant impact on the outcome variable.For the constant coefficient
,the significance also drops as sample size increases but at a slower rate than the slope coefficient.It
levels off approaching zero.
3]The graph shows the significance of a simple linear regression (SLR) model increasing as the
sample size increases. Generally speaking, larger sample sizes provide more precise estimates of a
model’s parameters and higher confidence in its predictions. The trend of increasing significance
correlates with decreasing p-values, would suggest the SLR model becomes more reliable at making
predictions as the sample size increases.
17
Figure 28: Model’s Significance over Samples in Dataset 3
References
Conceptual Referance : DS203 Lecture Notes
Implentation process : E2-process-data.ipynb
18