DADM NOTES and Cheat Sheet
DADM NOTES and Cheat Sheet
Notes:
The bar graph is the graphical representation of categorical data.
A histogram is the graphical representation of quantitative data.
Categorical Data (qualitative):
Covariance:
If x and y move in the same direction: positive covariance
If x and y move in different directions: negative covariance
Covariance does not tell us about the strength of the linear association
Correlation:
r can be positive or negative
-1<=r<=1
the closer it is to extremes (-1 or 1) the stronger their LINEAR relationship is
0 or close to 0, there is no linear relationship
Drawbacks:
Co relation is very sensitive to outliers
one outlier can change the data
Causation:
Causal relationship - a change in variable causes a change in the other variable
Correlations and Covaraince do not imply causation
In spurious relations, two variables are wrongly assumed to be related to each other
In spurious relations, there is typically a 3rd lurking variable that drives both variables.
2. Multiple linear regression = linear regression with more than one (quantitative or
categorical) predictors
3. Logistic regression = nonlinear regression with a categorical dependent variable
R Sqaure:
The coefficient of determination
0 ≤ r2 ≤ +1
r2 shows how much of variability in Y is explained by variability in X (i.e., how much is
explained by the linear model) r2 shows how good the linear model is
The closer to +1, the better the model
Multiple Regression:
In multiple regression, R-square is always > Adjusted R square.
R square will always go up even if you add a bad predictor. However, R- squared Adjusted may
go down if you add a bad predictor
Therefore, R-squared will always exceed R-squared adjusted
Logistic regression:
Logistic regression is used to model situations when the dependent variable (Y) is categorical
and may take only two values.
Kahoot:
1. Pivot tables can be used for categorical and quatitative variables: (2 answers)
Ans: TRUE
IT DEPENDS
Ans: False
Ans: True
Ans: False
Ans: True
9. Correlation = 0 means independence
Ans: False
10. When correlation = 0, it always means that X and Y are not related to each other
Ans: False
Ans: True
13. Cheese consumption is positively correlated with # deaths by being tangled in bedsheets
because?
15. Multiple linear regression means there are several linear regression equations
Ans: False
17. In a multiple regression, when we add a new X variable, R-squared ALWAYS goes up.
Ans: True
18. In a multiple regression, when we add new X variable, R-squared adjusted ALWAYS
goes up.
Ans: False
Ans: False
Ans: True
IMP Formulas: