0% found this document useful (0 votes)
47 views

DS-203: E2 Assignment - Linear Regression Report: Sahil Barbade (210040131) 29th Jan 2024

This report analyzes three datasets using linear regression to understand the impact of data quality and sample size. It examines the characteristics of each dataset through preprocessing and visualization. Pearson correlation coefficients are calculated, showing Dataset 3 has a near-perfect linear relationship between variables. Linear regression is applied with varying sample sizes. Dataset 3 consistently shows high R2 values even with small samples, indicating a predictable relationship. Residuals are also analyzed for trends that could violate linear regression assumptions. Overall, the report comprehensively evaluates the datasets with respect to linear regression metrics and assumptions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

DS-203: E2 Assignment - Linear Regression Report: Sahil Barbade (210040131) 29th Jan 2024

This report analyzes three datasets using linear regression to understand the impact of data quality and sample size. It examines the characteristics of each dataset through preprocessing and visualization. Pearson correlation coefficients are calculated, showing Dataset 3 has a near-perfect linear relationship between variables. Linear regression is applied with varying sample sizes. Dataset 3 consistently shows high R2 values even with small samples, indicating a predictable relationship. Residuals are also analyzed for trends that could violate linear regression assumptions. Overall, the report comprehensively evaluates the datasets with respect to linear regression metrics and assumptions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

DS-203 : E2 Assignment - Linear Regression Report

Sahil Barbade (210040131)


29th Jan 2024

Abstract
This report presents a comprehensive analysis of three datasets, namely E2-set1.csv, E2-
set2.csv, and E2-set3.csv, with a focus on understanding the interplay between data quality,
sample size, and various metrics in the context of Linear Regression. The study involves pre-
processing the datasets to assess their characteristics, comparing their quality, and subsequently
applying Linear Regression models to investigate the impact of different sample sizes.

Introduction
This report explores three datasets—E2-set1.csv, E2-set2.csv, and E2-set3.csv—focusing on how
data quality, sample size, and various metrics impact Linear Regression. We start by examining
and comparing dataset characteristics through preprocessing.
Moving on, we analyze the effect of different sample sizes (5 to 100) on Linear Regression
metrics. This includes investigating coefficients, p-values, Confidence Intervals, and key metrics
like R2, MSE, and F-Statistic. Our analysis goes beyond surface-level observations, incorporating
a thoughtful exploration grounded in theory.
We also review the provided Python notebook, E2-process-data.ipynb, to understand its flow
and outputs, providing a foundational understanding for subsequent analyses.
Specific datasets, like E2-set1.csv, are scrutinized for peculiar trends, such as the stagnant av-
erage R2 value despite an increase in sample size. Similar attention is given to E2-set3.csv, where
a surprisingly high average R2 value with a small sample size prompts thoughtful conclusions.

In summary, this report not only fulfills exercise requirements but also serves as a practical
guide to navigating the complexities of Linear Regression with different datasets and sample sizes.
The subsequent sections present detailed analyses, critical observations, and practical guidelines
for evaluating and accepting Linear Regression models.

PART - A
In our analysis of the three provided datasets, we aim to delve into their distinctive characteristics.
To achieve a comprehensive understanding, we address the following inquiries:

• Understanding characteristics of Data.


• Analysis through visualisation of various meaningful plots.
• Statistical Study on distribution of provided data.

Parameters Used for Data Quality and LR Analysis


Regression Metrics
• R-squared (R2):R-squared measures the proportion of the variance in the dependent variable
(y) that is predictable from the independent variable (x).A higher R-squared value indicates
a better fit of the model to the data. It is a crucial metric for understanding the goodness of
fit.

1
• Mean Squared Error (MSE):MSE measures the average squared difference between actual
and predicted values.A lower MSE indicates that the model’s predictions are closer to the
actual values, representing better accuracy.
• Residual Analysis (QQ Plot, Density Plot, Scatter Plot of Errors): Residual analysis assesses
the distribution and patterns of errors after training the model.A centered distribution of
errors around zero, as seen in QQ plots and density plots, indicates that the model’s predic-
tions are accurate. Scatter plots of errors help visually identify any patterns or trends in the
residuals.
• Pearson Correlation Coefficient (r): The Pearson correlation coefficient quantifies the degree
of linear correlation between two variables. The value of r ranges from -1 to 1. A positive
value (closer to 1) indicates a positive linear correlation (as one variable increases, the other
tends to increase). A negative value (closer to -1) indicates a negative linear correlation
(as one variable increases, the other tends to decrease). A value of 0 indicates no linear
correlation.The closer the absolute value of r is to 1, the stronger the linear relationship. The
sign indicates the direction of the relationship.

Significance Metrics
• F-Statistic: The F-Statistic tests the overall significance of the regression model. A higher
F-Statistic suggests that the linear regression model is statistically significant, indicating that
at least one independent variable has a nonzero effect on the dependent variable.
• P-values for Coefficients (Intercept and Slope):P-values indicate the probability that the
observed results (or more extreme) could have occurred by chance. Lower p-values (typically
below 0.05) suggest that the corresponding coefficients are statistically significant. In the
context of linear regression, this means the intercept and slope are likely not zero.

Dataset and its Quality Analysis


Pearson Correlations

Dataset Correlation Coefficient


Dataset 1 0.6849
Dataset 2 0.8200
Dataset 3 0.9909

Table 1: Pearson Correlation Coefficients for Different Datasets

• Dataset 1: A positive correlation of 0.6849 suggests a moderate linear relationship between


the independent variable (x) and the dependent variable (y) in Dataset 1.The moderate
correlation indicates that there is a discernible linear trend in the data, but it’s not extremely
strong.
• Dataset 2: A strong positive correlation of 0.8200 suggests a more pronounced linear rela-
tionship between x and y in Dataset 2.The stronger correlation in Dataset 2 indicates a more
reliable linear trend, making it potentially well-suited for linear regression analysis.
• Dataset 3: A very high positive correlation of 0.9909 indicates an extremely strong linear
relationship between x and y in Dataset 3.The exceptionally high correlation in Dataset 3
suggests a nearly perfect linear trend. This dataset appears to have a very strong linear
relationship, making it highly suitable for linear regression analysis.

A higher pearson correlation coefficient suggests a more predictable and consistent linear re-
lationship between the variables.For linear regression, a strong correlation is often desirable as
it implies that the independent variable can explain a significant portion of the variance in the
dependent variable. However, while a high pearson correlation is an indicator of a potential good
fit for linear regression, other factors (such as statistical significance of coefficients, normality of
residuals, etc.) should also be considered.

2
Data Distribution and SLR Fitting

Figure 1: Data Distribution of Dataset 1

• Dataset 1: The scatter plot for datset 1 illustrates a wide dispersion of data points around
the regression line, indicating considerable variability in the relationship between the inde-
pendent and dependent variables. The scattered distribution suggests the possible presence
of heteroscedasticity or subtle non-linear patterns in the data. These deviations from a per-
fectly linear relationship in Set 1 warrant further investigation to comprehend the underlying
dynamics influencing the observed variability.

Figure 2: Data Distribution of Dataset 2

• Dataset 2: In contrast, the scatter plot for datset 2 shows a more tightly clustered group
of data points around the regression line, indicating a strong and clear linear relationship
with reduced scatter. This concentrated clustering suggests lower variability and a consistent
linear pattern between the variables. The cohesive nature of the data points in Set 2 indicates
a higher level of predictability, making it an ideal candidate for linear regression modeling.
The reduced scatter implies a stronger correlation between the variables, contributing to a
more dependable predictive model.

Figure 3: Data Distribution of Dataset 3

3
• Dataset 3: Moving on to datset 3, the scatter plot demonstrates an even more concentrated
distribution of data points around the regression line compared to Sets 1 and 2. This suggests
an extremely consistent linear relationship with minimal scatter. The clustered distribution
of data points in Set 3 indicates minimal variability, pointing toward an almost perfect
linear pattern in the relationship between the variables. The precision and tightness of this
relationship in Set 3 make it an intriguing dataset, potentially reflecting a well-defined and
predictable relationship between the variables. The minimal scatter suggests a high level of
homoscedasticity and reinforces the reliability of this linear model.

Error or SLR Residual Analysis


Dataset 1

Figure 4: Residual Distribution of Dataset 1

Dataset 2

Figure 5: Residual Distribution of Dataset 2

Dataset 3

Figure 6: Residual Distribution of Dataset 3

• From Scatter plt, one thing to note is there’s no clear funnel shape, which is good because a
funnel shape would indicate heteroscedasticity, where the variance of residuals changes with
the level of prediction.
• In Quantile-Quantile Plot, there are some deviations from the line at the ends of the plot,
which may indicate that the extreme values of the residuals do not perfectly align with the
normal distribution (potentially some light tails).

4
• Plots indicate that the residuals from the SLR on Dataset 1 are mostly randomly distributed,
centered around zero (mean), generally follow a normal distribution, and do not exhibit clear
signs of heteroscedasticity. There are minor deviations from normality, especially in the tails,
but these might not be significant depending on the context and the specifics of the analysis.

Metrics & Significance Evaluation

Sample MSE R2 Coeff_0 pvalue_0 Coeff_0_CI_Low Coeff_0_CI_High Coeff_1


Data 1 0.548419 0.469074 2.431085 8.227393e-293 2.341188 2.226369 2.383911
Data 2 0.243742 0.672446 2.38739 0.0 2.327459 2.317579 2.422607
Data 3 0.00975 0.981812 2.317478 0.0 2.305492 2.463516 2.484521

Sample pvalue_1 Coeff_1_CI_Low Coeff_1_CI_High F-Statistic F-pvalue


Data 1 2.281438e-139 2.520982 2.541453 881.733866 2.281438e-139
Data 2 4.112217e-244 2.447322 2.527635 2048.830373 4.112217e-244
Data 3 0.0 2.329464 2.505527 53872.290354 0.0

Table 2: Regression Model Comparison

Analysis Comparision

• Mean Squared Error (MSE): Data 3 has the lowest MSE (0.00975), indicating that the
model’s predictions are closest to the actual values when compared to Data 1 (0.548419) and
Data 2 (0.243742). This suggests that the model built using Data 3 has the smallest average
squared differences between predicted and actual values.

• R-squared (R2): R2 measures the proportion of the variance in the dependent variable that
is predictable from the independent variable(s). Data 3 has the highest R2 (0.981812),
indicating that it explains a significant portion of the variance in the dependent variable,
followed by Data 2 (0.672446) and Data 1 (0.469074). This suggests that the model based
on Data 3 provides the best fit to the data.

• F-Statistic: The F-Statistic tests the overall significance of the regression model. Data 3
has an exceptionally high F-Statistic (53872.290354), indicating a strong overall fit of the
model, surpassing both Data 1 (881.733866) and Data 2 (2048.830373). This suggests that
the overall regression model based on Data 3 is highly significant.
• Coefficients and P-values:All datasets have highly statistically significant coefficients, as ind
icated by the very low p-values. However, Data 3 stands out with p-values of 0.0 for both
coefficients, indicating an extremely high level of significance.
• Confidence Intervals: The confidence intervals for coefficients in all datasets are relatively
narrow, suggesting a high level of precision in the estimates. This indicates a high level of
confidence in the range of values within which the true population parameters are likely to
fall.

Conclusion
• In conclusion, the comparison of regression models across three datasets reveals distinct
performance differences. Data 3 stands out as the top-performing dataset, with the lowest
mean squared error (MSE) and the highest coefficient of determination (R2), indicating the
best fit for the regression model. Additionally, its F-Statistic and p-values for coefficients are
substantially higher, signifying a high level of statistical significance.
• Data 2 also demonstrates strong performance, particularly in terms of R2 and F-Statistic,
suggesting a good fit for the regression model and high statistical significance. However,
Data 1 lags behind in these metrics, indicating a relatively weaker fit and lower statistical
significance.

5
• Furthermore, all datasets exhibit narrow confidence intervals for coefficients, indicating a
high level of precision in the estimates and a high degree of confidence in the range of values
within which the true population parameters are likely to fall.

Notebook Overview - E2-process-data.ipynb


• The analysis begins by importing the dataset from the CSV file "E2-set2.csv" using the
Pandas library for data manipulation and analysis. A scatter plot of the dataset is then
generated using the Matplotlib library, offering an initial visual representation of the data
distribution.
• To ensure reproducibility in the random sampling process, the NumPy seed is set to 42.
Subsequently, the code fits 10 simple linear regression (SLR) models to random samples of
the dataset. For each model, a range of metrics such as Mean Squared Error (MSE), R2
value, coefficients, p-values, confidence intervals, F-Statistic, and F-pvalue are computed.
• The predicted values from each SLR model are plotted on the same graph to visualize the
variability across different samples. The results of each SLR model are then stored in a
Pandas DataFrame named "results df" for further analysis.
• Finally, a summary table is generated to display the Sample number, MSE, R2, coefficients,
p-values, F-Statistic, and F-pvalue for each SLR mode

PART - B : Sample-Wise Analysis


1 Abstract
This study investigates the influence of sample size on the performance of linear regression models
using three distinct datasets. By systematically manipulating the sample size and analyzing the
resulting model statistics and metrics, we uncover valuable insights into the impact of sample
size on model performance. Our findings reveal nuanced relationships between sample size, model
accuracy, and statistical significance, providing a deeper understanding of the factors influencing
regression model quality.

2 Sample-Wise Summary
Dataset 1

Figure 7: Dataset 1 : Sample Size 5

6
Figure 8: Dataset 1 : Sample Size 10

Figure 9: Dataset 1 : Sample Size 20

7
Figure 10: Dataset 1 : Sample Size 50

Figure 11: Dataset 1 : Sample Size 100

8
Dataset 2

Figure 12: Dataset 2 : Sample Size 5

Figure 13: Dataset 2 : Sample Size 10

9
Figure 14: Dataset 2 : Sample Size 20

Figure 15: Dataset 2 : Sample Size 50

10
Figure 16: Dataset 2 : Sample Size 100

11
Dataset 3

Figure 17: Dataset 3 : Sample Size 5

Figure 18: Dataset 3 : Sample Size 10

12
Figure 19: Dataset 3 : Sample Size 20

Figure 20: Dataset 3 : Sample Size 50

13
Figure 21: Dataset 3 : Sample Size 100

3 Significant findings
3.1 Dataset 1
1] The fact that the average R2 value does not improve even when increasing the sample size from
10 to 100 could be due to some factors:
• The underlying relationship between the independent and dependent variable is weak or
non-linear, and the model may be underfitted to the data.
• The additional data does not provide new information that helps better explain the variation
in the dependent variable.
• There is high variability in the dataset that is not explained by the model, possibly due to
high levels of noise in the data.
• The model may not be the best option for the data, suggesting a different type of model or
additional explanatory variables may be needed.

Since the R2 value remains low even with a larger sample size, it may indicate that the models
do not adequately capture the underlying relationship, making them less reliable for predictions
or understanding the data.

14
Figure 22: Metrics Variation over Sample Size

2] From average metrics variation like (R2,MSE,p-value etc) we can conclude these insights :

• The mean squared error (MSE) decreases with larger sample sizes, indicating the model’s
predictions become more accurate on average.
• The R-squared (R2) value increases with more data, meaning a higher proportion of the
variability in the target variable is explained by the model.
• In terms of usability, the trends suggest the models will perform better - providing more
reliable and accurate predictions - when trained on a larger sample of data.
• Whether to use the models in a real-world application would depend on factors like acceptable
error thresholds, the project goals, and costs of data collection versus gains in precision.If
performance continues improving and meets requirements, a model trained on the largest
available sample size would likely be the best choice.

Figure 23: Coefficient Significance over Samples

3] In terms of the usability of the models, the decreasing trend in the significance values indicates
the model parameters become more meaningful as the sample size increases. This suggests the
model reliability and predictive ability may improve with larger datasets. If the significance drops
below common thresholds like 0.05, it would be reasonable to consider using the model.

15
Figure 24: Model’s Significance over Samples

4] As the sample size increases, the significance value of the simple linear regression (SLR)
model also increases, showing a positive correlation. This indicates that the model is becoming
more statistically significant as the sample size gets larger.Generally speaking, a higher F-statistic
value suggests the variation explained by the model is likely not due to chance.

Figure 25: Confidence Interval Variation over Samples

5] As the sample size increases, the confidence intervals for both the constant and slope decrease,
indicating more precise estimates. The blue line for constant significance shows a steep initial
decline, then levels off, suggesting marginal gains from further increasing sample size. The orange
dotted line for slope significance follows a similar pattern but with generally lower values.

3.2 Dataset 3
1] The average R2 value (average across samples for a given sample size), does not improve even
when the sample size is increased from 10 to 100 due to these possible factors :

• The MSE appears to be constant across all sample sizes with a value slightly above 0. On
the other hand, the R2 is also constant but with a value slightly below 1.0, indicating an
almost perfect or near perfect fit.
• A low MSE close to 0, as shown in the graph, indicates that the model predictions are very
close to the actual data points, which is desirable or may be undesirable due to overfitting
scenarios.An R2 value close to 1 indicates a model that explains almost all the variability of
the response data around its mean. A high R2 value is generally considered indicative of a
good fit.
• Given that these metrics show consistent and ideal performance across all sample sizes, the
conclusions that can be drawn about the usefulness of the models based on this dataset are:

16
• The model appears to be highly accurate as indicated by the low MSE.The model has excel-
lent predictive power as indicated by the high R2.Therefore, based on this dataset and the
provided metrics, it seems justified to use this model.
• However, it is important to note that this is an idealized scenario which is unlikely in real-
world data. This leads to some skepticism as models rarely perform perfectly across different
sample sizes unless the data is very simple or the model is overfit. In practice, other factors
like data complexity, number of features, overfitting, data quality and how the model gener-
alizes to unseen data are also important to consider.

Figure 26: Metrics Variation over Samples in Dataset 3

2]The significance value drops sharply as sample size increases, then levels off close to zero.A lower
significance value typically means the slope coefficient is statistically significant, suggesting the
explanatory variable has a significant impact on the outcome variable.For the constant coefficient
,the significance also drops as sample size increases but at a slower rate than the slope coefficient.It
levels off approaching zero.

Figure 27: Parameter Significance over Samples in Dataset 3

3]The graph shows the significance of a simple linear regression (SLR) model increasing as the
sample size increases. Generally speaking, larger sample sizes provide more precise estimates of a
model’s parameters and higher confidence in its predictions. The trend of increasing significance
correlates with decreasing p-values, would suggest the SLR model becomes more reliable at making
predictions as the sample size increases.

17
Figure 28: Model’s Significance over Samples in Dataset 3

References
Conceptual Referance : DS203 Lecture Notes
Implentation process : E2-process-data.ipynb

18

You might also like