0% found this document useful (0 votes)
25 views

Encounted AI - Interview Questions

1. The document contains questions related to machine learning concepts, algorithms, and applications. 2. It includes questions about handling issues like missing values, outliers, imbalanced data and more. 3. It also asks about specific machine learning algorithms like decision trees, their working mechanisms, advantages and limitations.

Uploaded by

MohitKhemka
Copyright
© © All Rights Reserved
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Encounted AI - Interview Questions

1. The document contains questions related to machine learning concepts, algorithms, and applications. 2. It includes questions about handling issues like missing values, outliers, imbalanced data and more. 3. It also asks about specific machine learning algorithms like decision trees, their working mechanisms, advantages and limitations.

Uploaded by

MohitKhemka
Copyright
© © All Rights Reserved
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
You are on page 1/ 37

SL No

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

41
42
43
44
45
46
47
48
49
50
51
52

53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82

83
84
85

86
87
88
89

90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
Question
Linear
What is word embeding
Diff between tf-idf & word embedding
What is loss function
What is gradient Discent
Diff between append & extend method python
How to handle sarcastic comments in sentiment analysis
what is the meaning of dirichlet in lda
how to handle imbalance data set
OneVsRestClassifier
What is Deep Leaning?
What is ANN?
What is Vanishing Gradients?
What is back Propagation?
What is sigmoid Function?
How to evaluate which model is best?
What is ROC curve
what is SMOTE
What is eigen value and eigen vector?
How PCA works?
By using which technique you identify features in your project?
What is continuous bag of words model?
What is resampling techniques and which technique you used in your project?
How to handle imbalance data set

What is confusion matrix and F1 score?


-->F1 = 1+2([precision*recall]/[precission+recall])
Precission -> Predicted Correct
Recall -> resulted correct
What is precission and recall?
What is Hyper parameters and name some in Random forest and SVM
What is learning rate in gradient discend?
What is dummy variable and why we need to delete ?
What is one hot encoding?
What are the assumptions in logistic regressions?
What is NER and how you implement it?
What is usage of POS tagger?
What is n-gram?
Difference between stemming and lemitization?
Time series analysis (Univariate analysis)
Market Basket Analysis
Apriori
Eclat
FP Growth
LSTM
Feature extraction like PCA, LDA/
Kbest, RFE and Univariate
ANOVA method analysis
Regulatization techynique to avoid over fitting like Ridge, Lasso,
POS Tagging
NLTK, Spacy and Text blob from AnalyticsVidhya
Ngram
Upper confidence Bound
Thompson Sampling
Anova,chi square analytics vidya
Assumptions of Linear regression (called Multivariate analysis)
text summarization
momentum in gradient discent

feature selection in machinelearning


parzen window
What is the difference between a generative and a discriminative algorithm?
out of bag error
how to identifies outliers in data set
multicoliniarity is a problem
t test z test f test
interquartile range
hinge function
tanh
l1 and l2 regression -lasson regression and ridge regression
generator function
systematic sampling
skewness of data
what is the filters we are using CNN
deeplearning optimization techinqe.
How to avoid Over-fitting
AB testing
Data cleansing part in NLP
hinge loss
logistic loss
back propagation(GAN's,kernal trik,largemargin classifier)
gausioan discrimental(naïve bays,svm,bernalies)
maximum likelihood estimation in logistic regression
sets in python
tranformer network
bert in mechine learning
tranfer learning
topic modelling
Descriptive and genarative model

ensembling method(Bagging(Boot strap aggrigation), Boosting, Random Forest


Churn modeling
forward selection, backward elemination

decission tree
Model Selection with AIC and BIC
Four Types Of Cross Validation| K-Fold | Leave One Out |Bootstrap | Hold Out
Bias Vs Variance

Regularization and Bias, Variances


R square Vs Adjusted R square
Opencv document scanner and edge detection
eigan vactor and eigan value
complete explanation about K-Means Clustering and K-NN algorithms.
Clustering Algorithms.
Determine the optimal value of K in K-Means Clustering?
How can we say that That K value is good which we provided?
How can we plot all the clusters?
what is Decision tree and Random forest how it is working?
entropy and information gain in Decision tree?
what is pruning in decision tree?
How PCA works and after implementing PCA what will happen?
what is the Dimensionality Reduction?
What are the PCA and LDA their uses? what will happen after applying PCA?
Model validation techniques.
How we can say weather model is accurate or not ?
Questions on Precision and recall in Classification report?
What is T-sne, why we are used
linear and non linear activation functions
leaky relu and relu
parametric,swish
named entity recocnization
sets in python
tranfermer network
bert in mechine learning
tranfer learning
topic modelling
eigan vactor and eigan value
Genie index
Bidaff
ULMFIT
ELMO
BERT
Stemming vs lemmatization
sigmoid function why do u use it..
Tan function why do u use it
AUC
confusion matrix
SEQ2SEQ
transformer in BERT
How do u evaluate ur model
kafka scaling
Latent drichlte allocation
Latent semantic annalysis
Hidden Markov models
Ranking algorithms
what error will help u in backpropogation
does MAE has gradient?
Glove ?
Language modelling
How do you find global minimum when u have many local minimum using Any gradient descent algo
Gaussion mixture model
learn to Rank, Personalisation, Factor Analysis, Predictive Modelling, Numeric Optimization
ML Pipeline
CatBoost
Light GBM
DBScan clustering
PAM clustering
d/f between LIST & ARRAY
D/F between matplotlib & pandas
Yellow algorithm
Difference between map & filter in python
Have you worked on flask deployment
Encoder - Decoder architecture
Null hypothesis in linear regression
what is vgg15
Encoder - Decoder architecture
d/f between descriptive statistics vs inferential statistisc
N-gram describe
k-means clustering (x-axis - value of k) & (y-axis - distoration)
Context analysis of NLP
seq2seq architecture
what is machine learning framework
what is deep learning framework
various types of bias for sampling
what is bias-variance tradeoff
difference betwenn svm & ksvm
how to handle missing data
why neural network in non-linear
what is chain rule
can you write any algorithm with out using any packages
what is elbow method
what is reinforcment learning
how backpropagation working on neural network
what is the combination of probability
when can use supervised k-means clustering
is random forest is baggin or boosting algorithm
How to select variable in decision tree
How is spliting decided for decision tree
How can I use decision tree when selectin variables in log regression
what approches you follow to convert from unstructed data to structure data
what is difference between skip gram & cbow
how to choose right machine learning algorithms
difference between logistic regression and sigmoid function
difference between pca & autoencoders
x

done

x
x
x

DONE-PLEAE CHK YOUR END


x

done
DONE
DONE

DONE
Learning Link's
https://medium.com/@mlengineer/generative-and-discriminative-models-af5637a66a3

https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

https://www.sciencedirect.com/science/article/pii/S235286481630027X

https://www.youtube.com/watch?v=xugjARegisk

https://www.youtube.com/watch?v=VPZiJGNX4_s
https://www.youtube.com/watch?v=j-EB6RqqjGI
https://www.youtube.com/watch?v=HBi-P5j0Kec

Accuracy: the percentage of predicions that are correct.


Recall: the percentage of a given class that is correctly identified.
Precision: the percentage that is correct for a given predicted class.
https://sebastianraschka.com/Articles/2014_python_lda.htmlhttps://www.youtube.com/watch?v=xlHk4okO8Ls

https://www.youtube.com/watch?v=Aw77aMLj9uM
https://github.com/icoxfog417/awesome-text-summarization

https://www.datacamp.com/community/tutorials/predictive-analytics-machine-learninghttps://towardsdatascience.com/feat

https://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-a-discriminative-algorithm

DONE-PLEAE CHK YOUR END


https://medium.com/@mlengineer/generative-and-discriminative-models-af5637a66a3https://medium.com/@D.Laupheimer

https://www.youtube.com/watch?v=nelJ3svz0_ohttps://www.youtube.com/watch?v=MRD67WgWonA-----------------------------

https://www.geeksforgeeks.org/decision-tree-introduction-example/http://dataaspirant.com/2017/01/30/how-decision-tree

https://www.youtube.com/watch?v=e0JcXMzhtdY
https://www.youtube.com/watch?v=fDQkUN9yw44

https://www.youtube.com/watch?v=4MKN-JkNGXYhttps://www.youtube.com/watch?v=lpkSGTT8uMg
https://www.youtube.com/watch?v=4MKN-JkNGXY
https://www.pyimagesearch.com/2014/09/01/build-kick-ass-mobile-document-scanner-just-5-minutes/
-and-a-discriminative-algorithm
t-5-minutes/
SL No Question
1 what issues faced during Deploying
2 Howt do you do string to int conversion
3 Explain about your projects. What are depended and independent variables. What is the goal, what h
4 Decision tree - how it will classify
5 Missing values- how to handle (by imputation), by boxplots, scatter plots
6 Outliers - how to handle (how many rows got removed)
7 Bold highlighted things -> why did u use it, what improvements u observed
8 choose an algorithm, its drawbacks, advantages
9 Where we can apply ML problems. Giving a problem and provide solution
10 imbalanced data -> how to balance it
11 Why used that algorithm only in the project. Why not others..?
12 Dataset what columns and features importance of them
13 Why dropped few columns
14 EDA - Type of data. (categorical, numerical) – are we doing classification
15 What is the Process of NLP.
16 What alogorithm used for NLP
17 Performance metrices and when to use which
18 Tell about favorite model in detail and how it works
19 More about the projects goal, dataset, why that model, hyperparameters tuning
20 Gradient descent
21 What is Train, test and validation data set and difference
22 Hadoop implementation mapreduce
Fractol 23 Performance metrics in NN models. Transformers used?
24 In current project, why used neural networks rather than traditional ML models
25 How do u initialize the weights in the neural network
26 Why RNN other than other NN models? How do u validate the performance of these models?
27 Why using RNN instead of LSTM model?
28 What is activation function and its use?
29 Significance of ReLU activation function? Why it is preferred compared to others?
30 Timeseries Arimax. How it works? Steps in arimax? How the parameters are determined?
31 Loan prediction project: Explain abt the project, How many features, how many are important. How di
32 Bias- Variance trade off explain
33 Different Performance metrics.
34 Why we use cross validation.
35 Difference between supervised and unsupervised learning
36 F1 score significance. What is the significance of precision and recall in it explain.
37 Explain SVM model. What is the difference between Linear and RBF kernels?
38 In NLP, what are difference conversion models? Provided a classification example. (Need to classify t
39 What model did u use for this NLP? Said Naïve Bayes. Told to explain about it.
40 What is conditional probability? What is covariance. Range of covariance (Told as 0 to 1).
41 If two events are independent, what is the probability of 1 with respect to other?
Wipro 42 Difference between covariance and standard deviation?
43 What is AIC
44 ROC
45 How to check Multicollinearity - How to test
46 Before and after treatment - How to identify if there i
47 What is co-variance
48 explain R2
49 Null and residual deviance
50 Categorical data How is it related to output finding
51 What is VIF
HCL 52 Explainability vs accuracy
53 Explain confusion matrix
54 Write python code to find prime numbers
Learning Link's
https://medium.com/@mlengineer/generative-and-discriminative-models-af5637a66a3

epended and independent variables. What is the goal, what has been achieved?

ation), by boxplots, scatter plots


s got removed)
it, what improvements u observed

g a problem and provide solution

ect. Why not others..?


ortance of them

cal) – are we doing classification

hy that model, hyperparameters tuning

et and difference

nsformers used?
orks rather than traditional ML models
ural network
How do u validate the performance of these models?

? Why it is preferred compared to others?


in arimax? How the parameters are determined?
project, How many features, how many are important. How did u determine the important features? Dimensionality reduction. Which model

upervised learning
cance of precision and recall in it explain.
ce between Linear and RBF kernels?
models? Provided a classification example. (Need to classify the tweets. Positive, negative, neutral). What is better model? (Told as TF-IDF
Naïve Bayes. Told to explain about it.
covariance. Range of covariance (Told as 0 to 1).
e probability of 1 with respect to other?
dard deviation?
Related to statistics. Can perform T-Test
y reduction. Which model gave best accuracy?

r model? (Told as TF-IDF) Why TF-IDF. Explain it. When classifying TF-IDF will decrease the significance of high count of words.
high count of words.
https://acadgild.com/blog/top-20-apache-spark-interview-questions-2019
https://www.kaggle.com/sumi25/understand-arima-and-tune-p-d-q
https://www.edureka.co/blog/interview-questions/top-50-hadoop-interview-questions-2016/
https://intellipaat.com/blog/interview-question/big-data-hadoop-interview-questions/
https://www.dezyre.com/article/top-100-hadoop-interview-questions-and-answers-2019/159

Python

https://www.edureka.co/blog/interview-questions/python-interview-questions/
https://www.guru99.com/python-interview-questions-answers.html
https://data-flair.training/blogs/top-python-interview-questions-answer/
https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/
https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/
https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/
https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/
Statistics for Data Science and Business Analysis
Glossary

Section Lesson Word


1 Population vs sample population
1 Population vs sample sample
1 Population vs sample parameter
1 Population vs sample statistic
1 Population vs sample random sample
2 Types of data representative sample
2 Types of data variable
2 Types of data type of data
2 Types of data categorical data
2 Types of data numerical data
2 Types of data discrete data
2 Types of data continuous data
2 Levels of measurement levels of measurement
2 Levels of measurement qualitative data
2 Levels of measurement quantitative data
2 Levels of measurement nominal
2 Levels of measurement ordinal
2 Levels of measurement ratio
2 Levels of measurement interval
2 Categorical variables. Visualization techniques frequency distribution table
2 Categorical variables. Visualization techniques frequency
2 Categorical variables. Visualization techniques absolute frequency
2 Categorical variables. Visualization techniques relative frequency
2 Categorical variables. Visualization techniques cumulative frequency
2 Categorical variables. Visualization techniques Pareto diagram
2 The Histogram histogram
2 The Histogram bins (histogram)
2 Cross table and scatter plot cross table
2 Cross table and scatter plot contigency table
2 Cross table and scatter plot scatter plot
2 Mean, median and mode measures of central tendency
2 Mean, median and mode mean
2 Mean, median and mode median
2 Mean, median and mode mode
2 Skewness measures of asymmetry
2 Skewness skewness
2 Variance sample formula
2 Variance population formula
2 Variance measures of variability
2 Variance variance
2 Standard deviation and coefficient of variation standard deviation
2 Standard deviation and coefficient of variation coefficient of variation
2 Covariance univariate measure
2 Covariance multivariate measure
2 Covariance covariance
2 Correlation linear correlation coefficient
2 Correlation correlation
3 What is a distribution distribution
3 The normal distribution Bell curve
3 The normal distribution Gaussian distribution
3 The normal distribution to control for the mean/std/etc
3 The standard normal distribution standard normal distribution
3 The standard normal distribution z-statistic
3 The standard normal distribution standardized variable
3 The central limit theorem central limit theorem
3 The central limit theorem sampling distribution
3 Standard error standard error
3 Estimators and estimates estimator
3 Estimators and estimates estimate
3 Estimators and estimates bias
3 Estimators and estimates efficiency (in estimators)
3 Estimators and estimates point estimator
3 Estimators and estimates point estimate

3 Estimators and estimates interval estimator


3 Estimators and estimates interval estimate
3 Definition of confidence intervals confidence interval
3 Definition of confidence intervals reliability factor

3 Definition of confidence intervals level of confidence


3 Population variance known, z-score critical value
3 Population variance known, z-score z-table
3 Student's T distribution t-statistic
3 Student's T distribution a rule of thumb
3 Student's T distribution t-table
3 Student's T distribution degrees of freedom
3 Margin of error margin of error
4 Null vs alternative hypothesis
4 Null vs alternative hypothesis test
4 Null vs alternative null hypothesis
4 Null vs alternative alternative hypothesis
4 Null vs alternative to accept a hypothesis
4 Null vs alternative to reject a hypothesis
4 Null vs alternative one-tailed (one-sided) test
4 Null vs alternative two-tailed (two-sided) test
4 Rejection region and significance level significance level
4 Rejection region and significance level rejection region
4 Type I error vs type II error type I error (false positive)
4 Type I error vs type II error type II error (false negative)
4 Type I error vs type II error power of the test
4 Test for the mean. Population variance known z-score
4 Test for the mean. Population variance known μ0
4 p-value p-value
4 Test for the mean. Population variance unknown email open rate
5 Correlation vs causation causation
5 Correlation vs causation GDP
5 The linear regression model regression analysis
5 The linear regression model linear regression model
5 The linear regression model dependent variable ( ŷ )
5 The linear regression model independent variable ( xi )
5 The linear regression model coefficient ( βi )
5 The linear regression model constant ( βo )
5 The linear regression model epsilon ( ε )
5 The linear regression model regression equation
5 The linear regression model b0, b1,…, bk
5 Geometrical representation regression line
5 Geometrical representation residual ( e )
5 Geometrical representation b0
5 Geometrical representation b1
5 Example SAT
5 Example GPA
5 Decomposition ANOVA
5 Decomposition SST
5 Decomposition SSR
5 Decomposition SSE
5 R-squared r-squared ( R2 )
5 OLS OLS
5 Regression tables regression tables
5 Multivariate linear regression model multivariate linear regression
5 Adjusted R-squared adjusted r-squared
5 F-test F-statistic
5 F-test F-test
5 Assumptions assumptions
5 Assumptions linearity
5 Assumptions homoscedasticity
5 Assumptions endogeneity
5 Assumptions autocorrelation
5 Assumptions multicollinearity
5 A2. No endogeneity omitted variable bias
5 A3. Normality and homoscedasticity heteroscedasticity
5 A3. Normality and homoscedasticity log transformation
5 A3. Normality and homoscedasticity semi-log model
5 A3. Normality and homoscedasticity log-log model
5 A4. No autocorrelation serial correlation
5 A4. No autocorrelation cross-sectional data
5 A4. No autocorrelation time series data
5 A4. No autocorrelation day of the week effect
Definition
The collections of all items of interest to our study; denoted N.
A subset of the population; denoted n.
A value that refers to a population. It is the opposite of statistic.
A value that refers to a sample. It is the opposite of a parameter.
A sample where each member is chosen from the population strictly by chance
A sample taken from the population to reflect the population as a whole
A characteristic of a unit which may assume more than one value. Eg. height, occupation, age etc.
A way to classify data. There are two types of data - categorical and numerical.
A subgroup of types of data. Describes categories or groups.
A subgroup of types of data. Represents numbers. Can be further classified into discrete and continuous.
Data that can be counted in a finite matter. Opposite of continuous.
Data that is 'infinite' and impossible to count. Opposite of discrete.
A way to classify data. There are two levels of measurement - qualitative and quantitative, which are further classed into nominal & ordinal, and ratio & in
A subgroup of levels of measurement. There are two types of qualitative data - nominal and ordinal.
A subgroup of levels of measurement. There are two types of quantitative data - ratio and interval.
Refers to variables that describe different categories and cannot be put in any order.
Refers to variables that describe different categories, but can be ordered.
A number that has a unique and unambiguous zero point, no matter if a whole number or a fraction
An interval variable represents a number or an interval. There isn't a unique and unambiguous zero point. For example, degrees in Celsius and Fahrenheit
A table that represents the frequency of each variable.
Measures the occurrence of a variable.
Measures the NUMBER of occurrences of a variable.
Measures the RELATIVE NUMBER of occurrences of a variable. Usually, expressed in percentages.
The sum of relative frequencies so far. The cumulative frequency of all members is 100% or 1.
A type of bar chart where frequencies are shown in descending order. There is an additional line on the chart, showing the cumulative frequency.
A type of bar chart that represents numerical data. It is divided into intervals (or bins) that are not overlapping and span from the first observation to the la
The intervals that are represented in a histogram.
A table which represents categorical data. On one axis we have the categories, and on the other - their frequencies. It can be built with absolute or relative
See cross table.
A plot that represents numerical data. Graphically, each observation looks like a point on the scatter plot.
Measures that describe the data through 'averages'. The most common are the mean, median and mode. There is also geometric mean, harmonic mean, w
The simple average of the dataset. Denoted μ.
The middle number in an ordered dataset.
The value that occurs most often. A dataset can have 0, 1 or multiple modes.
Measures that describe the data through the level of symmetry that is observed. The most common are skewness and kurtosis.
A measure that describes the dataset's symmetry around its mean.
A formula that is calculated on a sample. The value obtained is a statistic.
A formula that is calculated on a population. The value obtained is a parameter.
Measures that describe the data through the level of dispersion (variability). The most common ones are variance and standard deviation.
Measures the dispersion of the dataset around its mean. It is measured in units squared. Denoted σ2 for a population and s2 for a sample.
Measures the dispersion of the dataset around its mean. It is measured in original units. It is equal to the square root of the variance. Denoted σ for a popu
Measures the dispersion of the dataset around its mean. It is also called 'relative standard deviation'. It is useful for comparing different datasets in terms
A measure which refers to a single variable.
A measure which refers to multiple variables.
A measure of relationship between two variables. Usually, because of its scale of measurement, covariance is not directly interpretable. Denoted σxy for a p
A measure of relationship between two variables. Very useful for direct interpretation as it takes on values from [-1,1]. Denoted ρxy for a population and rxy
A measure of the relationship between two variables. There are several ways to compute it, the most common being the linear correlation coefficient.
A function that shows the possible values for a variable and the probability of their occurrence.
A common name for the normal distribution.
The original name of the normal distribution. Named after the famous mathematician Gauss, who was the first to explore it through his work on the Gauss
While holding a particular value constant, we change the other variables and observe the effect.
A normal distribution with a mean of 0, and a standard deviation of 1
The statistic associated with the normal distribution
A variable which has been standardized using the z-score formula - by first subtracting the mean and then dividing by the standard deviation
No matter the distribution of the underlying dataset, the sampling distribution of the means of the dataset approximate a normal distribution.
the distribution of a sample.
the standard error is the standard deviation of the sampling distribution. It takes the size of the sample into account
Estimations we make according to a function or rule
The particular value that was estimated through an estimator.
An unbiased estimator has an expected value the population parameter. A biased one has an expected value different from the population parameter. The
in the context of estimators, the efficiency loosely refers to 'lack of variability'. The most efficient estimator is the one with the least variability. It is a comp
A function or a rule, according to which we make estimations that will result in a single number.
A single number that is derived from a certain point estimator.

A function or a rule, according to which we make estimations that will result in an interval. In this course, we will only consider confidence intervals. Anoth
A particular result that was obtained from an interval estimator. It is an interval.
A confidence interval is the range within which you expect the population parameter to be. You have a certain probability of it being correct, equal to the s
A value from a z-table, t-table, etc. that is associated with our test.

Shows in what % of cases we expect the population parameter to fall into the confidence interval we obtained. Denoted 1 - α. Example: 95% confidence le
A value coming from a table for a specific statistic (z, t, F, etc.) associated with the probability (α) that the researcher has chosen.
A table associated with the Z-statistic, where given a probability (α), we can see the value of the standardized variable, following the standard normal distr
A statistic that is generally associated with the Student's T distribution, in the same way the z-statistic is associated with the normal distribution.
A principle which is approximately true and is widely used in practice due to its simplicity.
A table associated with the t-statistic, where given a probability (α), and certain degrees of freedom, we can check the reliability factor.
The number of variables in the final calculation that are free to vary.
Half the width of a confidence interval. It drives the width of the interval.
Loosely, a hypothesis is 'an idea that can be tested'
A test that is conducted in order to verify if a hypothesis is true or false.
The null hypothesis is the one to be tested. Whenever we are conducting a test, we are trying to reject the null hypothesis.
The alternative hypothesis is the opposite of the null. It is usually the opinion of the researcher, as he is trying to reject the null hypothesis and thus accept
The statistical evidence shows that the hypothesis is likely to be true.
The statistical evidence shows that the hypothesis is likely to be false.
Tests which determine if a value is lower (or equal) or higher (or equal) to a certain value are one-sided. This is because they can only be rejected on one s
Tests which determine if a value is equal (or different) to a certain value are two-sided. This is because they can be rejected on two sides - if the paramete
The probability of rejecting the null hypothesis, if it is true. Denoted α. You choose the significance level. All else equal, the lower the level, the better the t
The part of the distribution, for which we would reject the null hypothesis.
This error consists of rejecting a null hypothesis that is true. The probability of committing it is α, the significance level.
This error consists of accepting a null hypothesis that is false. The probability of committing it is β.
Probability of rejecting a null hypothesis that is false (the researcher's goal). Denoted by 1- β.
The standardized variable associated with the dataset we are testing. It is observed in the table with an α equal to the level of significance of the test.
The hypothesized population mean.
The smallest level of significance at which we can still reject the null hypothesis given the observed sample statistic.
A measure of how many people on an email list actually open the emails they have received.
Causation refers to a causal relationship between two variables. When one variable changes, the other changes accordingly. When we have causality, varia
Gross domestic product is a monetary measure of the market value of all final goods and services produced for a specific country for a period.
A statistical process for estimating relationships between variables. Usually, it is used for building predictive models.
A linear approximation of a causal relationship between two or more variables.
The variable that is going to be predicted. It also 'depends' on the other variables. Usually, denoted y.
A variable that is going to predict. It is the observed data (your sample data). Usually, denoted x1, x2 to xk.
A numerical or constant quantity placed before and multiplying the variable in an algebraic expression.
This is a constant value, which does not affect any independent variable, but affects the dependent one in a constant manner.
The error of prediction. Difference between the observed value and the (unobservable) true value.
An equation, where the coefficients are estimated from the sample data. Think of it as an estimator of the linear regression model
Estimates of the coefficients βo, β1, … βk.
The best-fitting line through the data points.
Difference between the observed value and the estimated value by the regression line. Point estimate of the error ( ε ).
The intercept of the regression line with the y-axis for a simple linear regression.
The slope of the regression line for a simple linear regression.
The SAT is a standardized test for college admission in the US.
Grade point average
Abbreviation of 'analysis of variance'. A statistical framework for analyzing variance of means.
Sum of squares total. SST is the squared differences between the observed dependent variable and its mean.
Sum of squares regression. SSR is the sum of the differences between the predicted value and the mean of the dependent variable. This is the variability e
Sum of squares error. SSE is the sum of the differences between the observed value and the predicted value. This is the variability that is NOT explained by
A measure ranging from 0 to 1 that shows how much of the total variability of the dataset is explained by our regression model.
An abbreviation of 'ordinary least squares'. It is a method for estimation of the regression equation coefficients.
In this context, they refer to the tables that are going to be created after you use a software to determine your regression equation.
Also known as multiple linear regression. There is a slight difference between the two, but are generally used interchangeably. In this course, it refers to a
A measure, based on the idea of R-squared, which penalizes the excessive use of independent variables.
The F-statistic is connected with the F-distribution in the same way the z-statistic is related to the Normal distribution.
A test for the overall significance of the model.
When performing linear regression analysis, there are several assumptions about your data. They are known as the linear regression assumptions.
Refers to linear.
Literally means the same variance.
In statistics refers to a situation, where an independent variable is correlated with the error term.
When different error terms in the same model are correlated to each other.
Refers to high correlation.
A bias to the error term, which is introduced when you forget to include an important variable in your model.
Literally means a different variance. Opposite of homoscedasticity.
A transformation of a variable(s) in your model, where you substitute that variable(s) with its logarithm.
One part of the model is log, the other is not.
Both parts of the model are logarithmical.
Autocorrelation.
Data taken at one moment in time.
A type of panel data. Usually, time series is a sequence taken at successive, equally spaced points in time, e.g. stock prices.
A well-known phenomenon in finance. Consists in disproportionately high returns on Fridays and low returns on Mondays.

You might also like