0% found this document useful (0 votes)

47 views

DS-203: E2 Assignment - Linear Regression Report: Sahil Barbade (210040131) 29th Jan 2024

This report analyzes three datasets using linear regression to understand the impact of data quality and sample size. It examines the characteristics of each dataset through preprocessing and visualization. Pearson correlation coefficients are calculated, showing Dataset 3 has a near-perfect linear relationship between variables. Linear regression is applied with varying sample sizes. Dataset 3 consistently shows high R2 values even with small samples, indicating a predictable relationship. Residuals are also analyzed for trends that could violate linear regression assumptions. Overall, the report comprehensively evaluates the datasets with respect to linear regression metrics and assumptions.

Uploaded by

sahilbarbade.ee101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

DS-203: E2 Assignment - Linear Regression Report: Sahil Barbade (210040131) 29th Jan 2024

Uploaded by

sahilbarbade.ee101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

DS-203 : E2 Assignment - Linear Regression Report

Sahil Barbade (210040131)

29th Jan 2024

Abstract
This report presents a comprehensive analysis of three datasets, namely E2-set1.csv, E2-
set2.csv, and E2-set3.csv, with a focus on understanding the interplay between data quality,
sample size, and various metrics in the context of Linear Regression. The study involves pre-
processing the datasets to assess their characteristics, comparing their quality, and subsequently
applying Linear Regression models to investigate the impact of different sample sizes.

Introduction
This report explores three datasets—E2-set1.csv, E2-set2.csv, and E2-set3.csv—focusing on how
data quality, sample size, and various metrics impact Linear Regression. We start by examining
and comparing dataset characteristics through preprocessing.
Moving on, we analyze the effect of different sample sizes (5 to 100) on Linear Regression
metrics. This includes investigating coefficients, p-values, Confidence Intervals, and key metrics
like R2, MSE, and F-Statistic. Our analysis goes beyond surface-level observations, incorporating
a thoughtful exploration grounded in theory.
We also review the provided Python notebook, E2-process-data.ipynb, to understand its flow
and outputs, providing a foundational understanding for subsequent analyses.
Specific datasets, like E2-set1.csv, are scrutinized for peculiar trends, such as the stagnant av-
erage R2 value despite an increase in sample size. Similar attention is given to E2-set3.csv, where
a surprisingly high average R2 value with a small sample size prompts thoughtful conclusions.

In summary, this report not only fulfills exercise requirements but also serves as a practical
guide to navigating the complexities of Linear Regression with different datasets and sample sizes.
The subsequent sections present detailed analyses, critical observations, and practical guidelines
for evaluating and accepting Linear Regression models.

PART - A
In our analysis of the three provided datasets, we aim to delve into their distinctive characteristics.
To achieve a comprehensive understanding, we address the following inquiries:

• Understanding characteristics of Data.

• Analysis through visualisation of various meaningful plots.
• Statistical Study on distribution of provided data.

Parameters Used for Data Quality and LR Analysis

Regression Metrics
• R-squared (R2):R-squared measures the proportion of the variance in the dependent variable
(y) that is predictable from the independent variable (x).A higher R-squared value indicates
a better fit of the model to the data. It is a crucial metric for understanding the goodness of
fit.

1
• Mean Squared Error (MSE):MSE measures the average squared difference between actual
and predicted values.A lower MSE indicates that the model’s predictions are closer to the
actual values, representing better accuracy.
• Residual Analysis (QQ Plot, Density Plot, Scatter Plot of Errors): Residual analysis assesses
the distribution and patterns of errors after training the model.A centered distribution of
errors around zero, as seen in QQ plots and density plots, indicates that the model’s predic-
tions are accurate. Scatter plots of errors help visually identify any patterns or trends in the
residuals.
• Pearson Correlation Coefficient (r): The Pearson correlation coefficient quantifies the degree
of linear correlation between two variables. The value of r ranges from -1 to 1. A positive
value (closer to 1) indicates a positive linear correlation (as one variable increases, the other
tends to increase). A negative value (closer to -1) indicates a negative linear correlation
(as one variable increases, the other tends to decrease). A value of 0 indicates no linear
correlation.The closer the absolute value of r is to 1, the stronger the linear relationship. The
sign indicates the direction of the relationship.

Significance Metrics
• F-Statistic: The F-Statistic tests the overall significance of the regression model. A higher
F-Statistic suggests that the linear regression model is statistically significant, indicating that
at least one independent variable has a nonzero effect on the dependent variable.
• P-values for Coefficients (Intercept and Slope):P-values indicate the probability that the
observed results (or more extreme) could have occurred by chance. Lower p-values (typically
below 0.05) suggest that the corresponding coefficients are statistically significant. In the
context of linear regression, this means the intercept and slope are likely not zero.

Dataset and its Quality Analysis

Pearson Correlations

Dataset Correlation Coefficient

Dataset 1 0.6849
Dataset 2 0.8200
Dataset 3 0.9909

Table 1: Pearson Correlation Coefficients for Different Datasets

• Dataset 1: A positive correlation of 0.6849 suggests a moderate linear relationship between

the independent variable (x) and the dependent variable (y) in Dataset 1.The moderate
correlation indicates that there is a discernible linear trend in the data, but it’s not extremely
strong.
• Dataset 2: A strong positive correlation of 0.8200 suggests a more pronounced linear rela-
tionship between x and y in Dataset 2.The stronger correlation in Dataset 2 indicates a more
reliable linear trend, making it potentially well-suited for linear regression analysis.
• Dataset 3: A very high positive correlation of 0.9909 indicates an extremely strong linear
relationship between x and y in Dataset 3.The exceptionally high correlation in Dataset 3
suggests a nearly perfect linear trend. This dataset appears to have a very strong linear
relationship, making it highly suitable for linear regression analysis.

A higher pearson correlation coefficient suggests a more predictable and consistent linear re-
lationship between the variables.For linear regression, a strong correlation is often desirable as
it implies that the independent variable can explain a significant portion of the variance in the
dependent variable. However, while a high pearson correlation is an indicator of a potential good
fit for linear regression, other factors (such as statistical significance of coefficients, normality of
residuals, etc.) should also be considered.

2
Data Distribution and SLR Fitting

Figure 1: Data Distribution of Dataset 1

• Dataset 1: The scatter plot for datset 1 illustrates a wide dispersion of data points around
the regression line, indicating considerable variability in the relationship between the inde-
pendent and dependent variables. The scattered distribution suggests the possible presence
of heteroscedasticity or subtle non-linear patterns in the data. These deviations from a per-
fectly linear relationship in Set 1 warrant further investigation to comprehend the underlying
dynamics influencing the observed variability.

Figure 2: Data Distribution of Dataset 2

• Dataset 2: In contrast, the scatter plot for datset 2 shows a more tightly clustered group
of data points around the regression line, indicating a strong and clear linear relationship
with reduced scatter. This concentrated clustering suggests lower variability and a consistent
linear pattern between the variables. The cohesive nature of the data points in Set 2 indicates
a higher level of predictability, making it an ideal candidate for linear regression modeling.
The reduced scatter implies a stronger correlation between the variables, contributing to a
more dependable predictive model.

Figure 3: Data Distribution of Dataset 3

3
• Dataset 3: Moving on to datset 3, the scatter plot demonstrates an even more concentrated
distribution of data points around the regression line compared to Sets 1 and 2. This suggests
an extremely consistent linear relationship with minimal scatter. The clustered distribution
of data points in Set 3 indicates minimal variability, pointing toward an almost perfect
linear pattern in the relationship between the variables. The precision and tightness of this
relationship in Set 3 make it an intriguing dataset, potentially reflecting a well-defined and
predictable relationship between the variables. The minimal scatter suggests a high level of
homoscedasticity and reinforces the reliability of this linear model.

Error or SLR Residual Analysis

Dataset 1

Figure 4: Residual Distribution of Dataset 1

Dataset 2

Figure 5: Residual Distribution of Dataset 2

Dataset 3

Figure 6: Residual Distribution of Dataset 3

• From Scatter plt, one thing to note is there’s no clear funnel shape, which is good because a
funnel shape would indicate heteroscedasticity, where the variance of residuals changes with
the level of prediction.
• In Quantile-Quantile Plot, there are some deviations from the line at the ends of the plot,
which may indicate that the extreme values of the residuals do not perfectly align with the
normal distribution (potentially some light tails).

4
• Plots indicate that the residuals from the SLR on Dataset 1 are mostly randomly distributed,
centered around zero (mean), generally follow a normal distribution, and do not exhibit clear
signs of heteroscedasticity. There are minor deviations from normality, especially in the tails,
but these might not be significant depending on the context and the specifics of the analysis.

Metrics & Significance Evaluation

Sample MSE R2 Coeff_0 pvalue_0 Coeff_0_CI_Low Coeff_0_CI_High Coeff_1

Data 1 0.548419 0.469074 2.431085 8.227393e-293 2.341188 2.226369 2.383911
Data 2 0.243742 0.672446 2.38739 0.0 2.327459 2.317579 2.422607
Data 3 0.00975 0.981812 2.317478 0.0 2.305492 2.463516 2.484521

Sample pvalue_1 Coeff_1_CI_Low Coeff_1_CI_High F-Statistic F-pvalue

Data 1 2.281438e-139 2.520982 2.541453 881.733866 2.281438e-139
Data 2 4.112217e-244 2.447322 2.527635 2048.830373 4.112217e-244
Data 3 0.0 2.329464 2.505527 53872.290354 0.0

Table 2: Regression Model Comparison

Analysis Comparision

• Mean Squared Error (MSE): Data 3 has the lowest MSE (0.00975), indicating that the
model’s predictions are closest to the actual values when compared to Data 1 (0.548419) and
Data 2 (0.243742). This suggests that the model built using Data 3 has the smallest average
squared differences between predicted and actual values.

• R-squared (R2): R2 measures the proportion of the variance in the dependent variable that
is predictable from the independent variable(s). Data 3 has the highest R2 (0.981812),
indicating that it explains a significant portion of the variance in the dependent variable,
followed by Data 2 (0.672446) and Data 1 (0.469074). This suggests that the model based
on Data 3 provides the best fit to the data.

• F-Statistic: The F-Statistic tests the overall significance of the regression model. Data 3
has an exceptionally high F-Statistic (53872.290354), indicating a strong overall fit of the
model, surpassing both Data 1 (881.733866) and Data 2 (2048.830373). This suggests that
the overall regression model based on Data 3 is highly significant.
• Coefficients and P-values:All datasets have highly statistically significant coefficients, as ind
icated by the very low p-values. However, Data 3 stands out with p-values of 0.0 for both
coefficients, indicating an extremely high level of significance.
• Confidence Intervals: The confidence intervals for coefficients in all datasets are relatively
narrow, suggesting a high level of precision in the estimates. This indicates a high level of
confidence in the range of values within which the true population parameters are likely to
fall.

Conclusion
• In conclusion, the comparison of regression models across three datasets reveals distinct
performance differences. Data 3 stands out as the top-performing dataset, with the lowest
mean squared error (MSE) and the highest coefficient of determination (R2), indicating the
best fit for the regression model. Additionally, its F-Statistic and p-values for coefficients are
substantially higher, signifying a high level of statistical significance.
• Data 2 also demonstrates strong performance, particularly in terms of R2 and F-Statistic,
suggesting a good fit for the regression model and high statistical significance. However,
Data 1 lags behind in these metrics, indicating a relatively weaker fit and lower statistical
significance.

5
• Furthermore, all datasets exhibit narrow confidence intervals for coefficients, indicating a
high level of precision in the estimates and a high degree of confidence in the range of values
within which the true population parameters are likely to fall.

Notebook Overview - E2-process-data.ipynb

• The analysis begins by importing the dataset from the CSV file "E2-set2.csv" using the
Pandas library for data manipulation and analysis. A scatter plot of the dataset is then
generated using the Matplotlib library, offering an initial visual representation of the data
distribution.
• To ensure reproducibility in the random sampling process, the NumPy seed is set to 42.
Subsequently, the code fits 10 simple linear regression (SLR) models to random samples of
the dataset. For each model, a range of metrics such as Mean Squared Error (MSE), R2
value, coefficients, p-values, confidence intervals, F-Statistic, and F-pvalue are computed.
• The predicted values from each SLR model are plotted on the same graph to visualize the
variability across different samples. The results of each SLR model are then stored in a
Pandas DataFrame named "results df" for further analysis.
• Finally, a summary table is generated to display the Sample number, MSE, R2, coefficients,
p-values, F-Statistic, and F-pvalue for each SLR mode

PART - B : Sample-Wise Analysis

1 Abstract
This study investigates the influence of sample size on the performance of linear regression models
using three distinct datasets. By systematically manipulating the sample size and analyzing the
resulting model statistics and metrics, we uncover valuable insights into the impact of sample
size on model performance. Our findings reveal nuanced relationships between sample size, model
accuracy, and statistical significance, providing a deeper understanding of the factors influencing
regression model quality.

2 Sample-Wise Summary
Dataset 1

Figure 7: Dataset 1 : Sample Size 5

6
Figure 8: Dataset 1 : Sample Size 10

Figure 9: Dataset 1 : Sample Size 20

7
Figure 10: Dataset 1 : Sample Size 50

Figure 11: Dataset 1 : Sample Size 100

8
Dataset 2

Figure 12: Dataset 2 : Sample Size 5

Figure 13: Dataset 2 : Sample Size 10

9
Figure 14: Dataset 2 : Sample Size 20

Figure 15: Dataset 2 : Sample Size 50

10
Figure 16: Dataset 2 : Sample Size 100

11
Dataset 3

Figure 17: Dataset 3 : Sample Size 5

Figure 18: Dataset 3 : Sample Size 10

12
Figure 19: Dataset 3 : Sample Size 20

Figure 20: Dataset 3 : Sample Size 50

13
Figure 21: Dataset 3 : Sample Size 100

3 Significant findings
3.1 Dataset 1
1] The fact that the average R2 value does not improve even when increasing the sample size from
10 to 100 could be due to some factors:
• The underlying relationship between the independent and dependent variable is weak or
non-linear, and the model may be underfitted to the data.
• The additional data does not provide new information that helps better explain the variation
in the dependent variable.
• There is high variability in the dataset that is not explained by the model, possibly due to
high levels of noise in the data.
• The model may not be the best option for the data, suggesting a different type of model or
additional explanatory variables may be needed.

Since the R2 value remains low even with a larger sample size, it may indicate that the models
do not adequately capture the underlying relationship, making them less reliable for predictions
or understanding the data.

14
Figure 22: Metrics Variation over Sample Size

2] From average metrics variation like (R2,MSE,p-value etc) we can conclude these insights :

• The mean squared error (MSE) decreases with larger sample sizes, indicating the model’s
predictions become more accurate on average.
• The R-squared (R2) value increases with more data, meaning a higher proportion of the
variability in the target variable is explained by the model.
• In terms of usability, the trends suggest the models will perform better - providing more
reliable and accurate predictions - when trained on a larger sample of data.
• Whether to use the models in a real-world application would depend on factors like acceptable
error thresholds, the project goals, and costs of data collection versus gains in precision.If
performance continues improving and meets requirements, a model trained on the largest
available sample size would likely be the best choice.

Figure 23: Coefficient Significance over Samples

3] In terms of the usability of the models, the decreasing trend in the significance values indicates
the model parameters become more meaningful as the sample size increases. This suggests the
model reliability and predictive ability may improve with larger datasets. If the significance drops
below common thresholds like 0.05, it would be reasonable to consider using the model.

15
Figure 24: Model’s Significance over Samples

4] As the sample size increases, the significance value of the simple linear regression (SLR)
model also increases, showing a positive correlation. This indicates that the model is becoming
more statistically significant as the sample size gets larger.Generally speaking, a higher F-statistic
value suggests the variation explained by the model is likely not due to chance.

Figure 25: Confidence Interval Variation over Samples

5] As the sample size increases, the confidence intervals for both the constant and slope decrease,
indicating more precise estimates. The blue line for constant significance shows a steep initial
decline, then levels off, suggesting marginal gains from further increasing sample size. The orange
dotted line for slope significance follows a similar pattern but with generally lower values.

3.2 Dataset 3
1] The average R2 value (average across samples for a given sample size), does not improve even
when the sample size is increased from 10 to 100 due to these possible factors :

• The MSE appears to be constant across all sample sizes with a value slightly above 0. On
the other hand, the R2 is also constant but with a value slightly below 1.0, indicating an
almost perfect or near perfect fit.
• A low MSE close to 0, as shown in the graph, indicates that the model predictions are very
close to the actual data points, which is desirable or may be undesirable due to overfitting
scenarios.An R2 value close to 1 indicates a model that explains almost all the variability of
the response data around its mean. A high R2 value is generally considered indicative of a
good fit.
• Given that these metrics show consistent and ideal performance across all sample sizes, the
conclusions that can be drawn about the usefulness of the models based on this dataset are:

16
• The model appears to be highly accurate as indicated by the low MSE.The model has excel-
lent predictive power as indicated by the high R2.Therefore, based on this dataset and the
provided metrics, it seems justified to use this model.
• However, it is important to note that this is an idealized scenario which is unlikely in real-
world data. This leads to some skepticism as models rarely perform perfectly across different
sample sizes unless the data is very simple or the model is overfit. In practice, other factors
like data complexity, number of features, overfitting, data quality and how the model gener-
alizes to unseen data are also important to consider.

Figure 26: Metrics Variation over Samples in Dataset 3

2]The significance value drops sharply as sample size increases, then levels off close to zero.A lower
significance value typically means the slope coefficient is statistically significant, suggesting the
explanatory variable has a significant impact on the outcome variable.For the constant coefficient
,the significance also drops as sample size increases but at a slower rate than the slope coefficient.It
levels off approaching zero.

Figure 27: Parameter Significance over Samples in Dataset 3

3]The graph shows the significance of a simple linear regression (SLR) model increasing as the
sample size increases. Generally speaking, larger sample sizes provide more precise estimates of a
model’s parameters and higher confidence in its predictions. The trend of increasing significance
correlates with decreasing p-values, would suggest the SLR model becomes more reliable at making
predictions as the sample size increases.

17
Figure 28: Model’s Significance over Samples in Dataset 3

References
Conceptual Referance : DS203 Lecture Notes
Implentation process : E2-process-data.ipynb

Assignment Linear Regression
No ratings yet
Assignment Linear Regression
10 pages
Ds203 Assignment-02: Programming For Data Science
No ratings yet
Ds203 Assignment-02: Programming For Data Science
10 pages
Regresión y Calibración
No ratings yet
Regresión y Calibración
6 pages
Subjective Questions
No ratings yet
Subjective Questions
8 pages
Linear Regression
No ratings yet
Linear Regression
38 pages
1683609733_Deck2-BusinessIntelligence-M1-ACSA
No ratings yet
1683609733_Deck2-BusinessIntelligence-M1-ACSA
15 pages
Assignment-Based Subjective Questions
100% (1)
Assignment-Based Subjective Questions
10 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
Lec 34
No ratings yet
Lec 34
15 pages
Uttam Linear Regression 17March24 (1)
No ratings yet
Uttam Linear Regression 17March24 (1)
82 pages
Lesson - 4.2 - Exploratory Data Analysis - Analyze - Phase
No ratings yet
Lesson - 4.2 - Exploratory Data Analysis - Analyze - Phase
50 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Statistic For Agriculture Studies: The Assumptions of Regression
No ratings yet
Statistic For Agriculture Studies: The Assumptions of Regression
6 pages
2023 Statistics Fin 10
No ratings yet
2023 Statistics Fin 10
14 pages
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
No ratings yet
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
16 pages
Part 2 Exploring Relationships Among Variables
No ratings yet
Part 2 Exploring Relationships Among Variables
8 pages
Chapter 6-Simple Linear Regression and Correlation
No ratings yet
Chapter 6-Simple Linear Regression and Correlation
23 pages
Linear_Regression (1)
No ratings yet
Linear_Regression (1)
35 pages
Linear_Regression_datascience_basit.pdf
No ratings yet
Linear_Regression_datascience_basit.pdf
19 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Linear Regression Analysis: Module - Iv
No ratings yet
Linear Regression Analysis: Module - Iv
10 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
7 pages
3 Da
No ratings yet
3 Da
16 pages
Module 3 - Regression and Correlation Analysis
No ratings yet
Module 3 - Regression and Correlation Analysis
54 pages
Regression
No ratings yet
Regression
19 pages
Linear Regression Subjective Questions
No ratings yet
Linear Regression Subjective Questions
14 pages
PFC - Workshop 08: Statistics
No ratings yet
PFC - Workshop 08: Statistics
4 pages
03 Linear Regression
No ratings yet
03 Linear Regression
29 pages
Linear Regression Analysis For STARDEX: Trend Calculation
No ratings yet
Linear Regression Analysis For STARDEX: Trend Calculation
6 pages
Unit 2 notes
No ratings yet
Unit 2 notes
4 pages
Linear Regression
100% (2)
Linear Regression
28 pages
Topic - 9 PDF
No ratings yet
Topic - 9 PDF
12 pages
Linear Regression Assignment Questions and Answer
No ratings yet
Linear Regression Assignment Questions and Answer
7 pages
Chapter 8 Regression Model - 2023
No ratings yet
Chapter 8 Regression Model - 2023
21 pages
meWeek 3
No ratings yet
meWeek 3
57 pages
Regression Modelling With Actuarial and Financial Applications - Key Notes
No ratings yet
Regression Modelling With Actuarial and Financial Applications - Key Notes
3 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
Questions Stats and Trix
No ratings yet
Questions Stats and Trix
39 pages
Yaregal Birhanu
No ratings yet
Yaregal Birhanu
8 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Anscombe's Quartet:: Data Sets
No ratings yet
Anscombe's Quartet:: Data Sets
3 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
W4 Multivariate Analysis
No ratings yet
W4 Multivariate Analysis
32 pages
Interpreting Correlation
No ratings yet
Interpreting Correlation
13 pages
PROBLEMS ch05
No ratings yet
PROBLEMS ch05
117 pages
Residual Analysis For Simple Linear Regression: X B B y N e N e
No ratings yet
Residual Analysis For Simple Linear Regression: X B B y N e N e
15 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Regression and Correlation Analysis
No ratings yet
Regression and Correlation Analysis
28 pages
It Skills and Data Analysis Group Project
No ratings yet
It Skills and Data Analysis Group Project
10 pages
Shell Regression
No ratings yet
Shell Regression
16 pages
Chapter 3
No ratings yet
Chapter 3
22 pages
UE20CS312 Unit2 Slides
No ratings yet
UE20CS312 Unit2 Slides
206 pages
DA-3rd unit
No ratings yet
DA-3rd unit
16 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Chapter 06-Regression Analysis
No ratings yet
Chapter 06-Regression Analysis
41 pages
Model Development
No ratings yet
Model Development
80 pages
Regression For Everyone Vol. 1
No ratings yet
Regression For Everyone Vol. 1
25 pages
Lecture 6-Revisions Chapter 1-5
No ratings yet
Lecture 6-Revisions Chapter 1-5
62 pages
Eco Trix
No ratings yet
Eco Trix
16 pages
Linear Regression Model
No ratings yet
Linear Regression Model
36 pages
0.binder - Final List and Dates-Mock Arbitration - Instruction and Cases
No ratings yet
0.binder - Final List and Dates-Mock Arbitration - Instruction and Cases
36 pages
TT2C71294643
No ratings yet
TT2C71294643
3 pages
Assignment 01
No ratings yet
Assignment 01
3 pages
Direct Shear Test
No ratings yet
Direct Shear Test
11 pages
Regression Problems in Python PDF
No ratings yet
Regression Problems in Python PDF
34 pages
Econometrics Vs ML
No ratings yet
Econometrics Vs ML
45 pages
Unit III
No ratings yet
Unit III
58 pages
Software Requirements Prioritisation Using Machine Learning
No ratings yet
Software Requirements Prioritisation Using Machine Learning
8 pages
Ai Datarobot
No ratings yet
Ai Datarobot
84 pages
Real Estate Price Prediction With Regression and Classification
No ratings yet
Real Estate Price Prediction With Regression and Classification
5 pages
Class 10 Artificial Intelligence Sample Paper Set 13
No ratings yet
Class 10 Artificial Intelligence Sample Paper Set 13
9 pages
Enterprise Artificial Intelligence and Machine Learning For Managers
100% (2)
Enterprise Artificial Intelligence and Machine Learning For Managers
97 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
DL - Unit - 1 - Foundations of Deep Learning
No ratings yet
DL - Unit - 1 - Foundations of Deep Learning
35 pages
A Novel Machine Learning-Based Early Warning Detection System For Business Customer Churn
No ratings yet
A Novel Machine Learning-Based Early Warning Detection System For Business Customer Churn
5 pages
ML Sample PDF
No ratings yet
ML Sample PDF
5 pages
Customer_Churn_Prediction_employing_Ensemble_Learning
No ratings yet
Customer_Churn_Prediction_employing_Ensemble_Learning
5 pages
dhanush_23[1]
No ratings yet
dhanush_23[1]
30 pages
ML Notes(BCS602)
No ratings yet
ML Notes(BCS602)
186 pages
DSTBD_10-DMClassification-ENG
No ratings yet
DSTBD_10-DMClassification-ENG
160 pages
ML Prediction of Global Ionospheric TEC Maps
No ratings yet
ML Prediction of Global Ionospheric TEC Maps
13 pages
Project Report 1
No ratings yet
Project Report 1
81 pages
INT247 Lect3.03.1
No ratings yet
INT247 Lect3.03.1
23 pages
Flow Diagram of Machine Learning or Life Cycle of Machine Learning
No ratings yet
Flow Diagram of Machine Learning or Life Cycle of Machine Learning
91 pages
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
No ratings yet
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
73 pages
CS3491 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING 01 - by WWW - LearnEngineering.in
No ratings yet
CS3491 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING 01 - by WWW - LearnEngineering.in
23 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Quality 4.0 - An Evolution of Six Sigma DMAIC
No ratings yet
Quality 4.0 - An Evolution of Six Sigma DMAIC
39 pages
ML Project
No ratings yet
ML Project
8 pages
Vignan's Nirula Institute of Technology and Science For Women
No ratings yet
Vignan's Nirula Institute of Technology and Science For Women
2 pages
DNN Full Merged Compressed Compressed
No ratings yet
DNN Full Merged Compressed Compressed
863 pages
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
No ratings yet
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
34 pages
important questions
No ratings yet
important questions
3 pages
Application of Machine Learning Techniques in Project Management
No ratings yet
Application of Machine Learning Techniques in Project Management
11 pages

DS-203: E2 Assignment - Linear Regression Report: Sahil Barbade (210040131) 29th Jan 2024

Uploaded by

DS-203: E2 Assignment - Linear Regression Report: Sahil Barbade (210040131) 29th Jan 2024

Uploaded by

DS-203 : E2 Assignment - Linear Regression Report

Sahil Barbade (210040131)

• Understanding characteristics of Data.

Parameters Used for Data Quality and LR Analysis

Dataset and its Quality Analysis

Dataset Correlation Coefficient

Table 1: Pearson Correlation Coefficients for Different Datasets

• Dataset 1: A positive correlation of 0.6849 suggests a moderate linear relationship between

Figure 1: Data Distribution of Dataset 1

Figure 2: Data Distribution of Dataset 2

Figure 3: Data Distribution of Dataset 3

Error or SLR Residual Analysis

Figure 4: Residual Distribution of Dataset 1

Figure 5: Residual Distribution of Dataset 2

Figure 6: Residual Distribution of Dataset 3

Metrics & Significance Evaluation

Sample MSE R2 Coeff_0 pvalue_0 Coeff_0_CI_Low Coeff_0_CI_High Coeff_1

Sample pvalue_1 Coeff_1_CI_Low Coeff_1_CI_High F-Statistic F-pvalue

Table 2: Regression Model Comparison

Notebook Overview - E2-process-data.ipynb

PART - B : Sample-Wise Analysis

Figure 7: Dataset 1 : Sample Size 5

Figure 9: Dataset 1 : Sample Size 20

Figure 11: Dataset 1 : Sample Size 100

Figure 12: Dataset 2 : Sample Size 5

Figure 13: Dataset 2 : Sample Size 10

Figure 15: Dataset 2 : Sample Size 50

Figure 17: Dataset 3 : Sample Size 5

Figure 18: Dataset 3 : Sample Size 10

Figure 20: Dataset 3 : Sample Size 50

Figure 23: Coefficient Significance over Samples

Figure 25: Confidence Interval Variation over Samples

Figure 26: Metrics Variation over Samples in Dataset 3

Figure 27: Parameter Significance over Samples in Dataset 3

You might also like